Pure-Python apply_edit / apply_edits helpers.
These functions produce the post-edit window string used to encode the
training target s_{t+1} (RFC-0003 §3.7). They are pure, importable
without torch, and load-bearing for both training and eval.
The multi-edit form applies edits right-to-left by descending
rel_pos (INV-ARCH-4) so each edit's relative position stays valid
through the sequence of mutations. Overlapping edits raise
:class:geno_lewm.errors.OverlappingEditsError. Reference-bases
mismatch raises :class:geno_lewm.errors.WindowMismatchError. Edits
whose locus is outside the window raise
:class:geno_lewm.errors.OutOfWindowError.
After applying edits the post-edit window length may change (indels).
:func:apply_edit and :func:apply_edits accept an optional
preserve_length=True argument that truncates / pads on the side
opposite the edit to preserve maximum context around the locus.
apply_edit
apply_edit(window: str, edit: RelEdit, *, preserve_length: bool = False) -> str
Return window with edit applied.
window is the pre-edit base string (uppercase ACGTN). The
function does not validate window contents beyond what the edit
locus requires; that is the caller's responsibility.
The reference bases at the edit locus must match edit.ref_bases
case-insensitively — otherwise :class:WindowMismatchError is
raised with the locus context attached.
Pass preserve_length=True to truncate / pad the result back to
the original window length on the side opposite the edit. The
default leaves the indel length change intact (length-preserving
is the trainer's responsibility for s_{t+1} encoding).
Source code in geno_lewm/action/apply.py
| def apply_edit(window: str, edit: RelEdit, *, preserve_length: bool = False) -> str:
"""Return ``window`` with ``edit`` applied.
``window`` is the pre-edit base string (uppercase ACGTN). The
function does not validate window contents beyond what the edit
locus requires; that is the caller's responsibility.
The reference bases at the edit locus must match ``edit.ref_bases``
case-insensitively — otherwise :class:`WindowMismatchError` is
raised with the locus context attached.
Pass ``preserve_length=True`` to truncate / pad the result back to
the original window length on the side opposite the edit. The
default leaves the indel length change intact (length-preserving
is the trainer's responsibility for ``s_{t+1}`` encoding).
"""
original_len = len(window)
end = edit.rel_pos + len(edit.ref_bases)
if edit.rel_pos < 0 or end > original_len:
raise OutOfWindowError(
"edit locus is outside the window",
details={
"rel_pos": edit.rel_pos,
"ref_len": len(edit.ref_bases),
"window_len": original_len,
},
)
observed = window[edit.rel_pos : end]
if observed.upper() != edit.ref_bases.upper():
raise WindowMismatchError(
"window bases do not match edit.ref_bases at locus",
details={
"rel_pos": edit.rel_pos,
"expected_ref": edit.ref_bases,
"observed_ref": observed,
},
remediation="re-fetch the window, or correct the EditSpec.ref",
)
edited = window[: edit.rel_pos] + edit.alt_bases + window[end:]
if not preserve_length:
return edited
return _truncate_or_pad(edited, original_len, edit_locus=edit.rel_pos)
|
apply_edits
apply_edits(window: str, edits: Sequence[RelEdit], *, preserve_length: bool = False) -> str
Apply a sequence of edits to window.
The edits are sorted by descending rel_pos and applied in that
order (INV-ARCH-4). Edits must not overlap in genomic coordinates;
overlap raises :class:OverlappingEditsError.
Equivalent inputs (same set of edits in any caller-supplied order)
produce equivalent outputs — the function is order-invariant after
the internal sort, which is the property the training pipeline
relies on.
The preserve_length flag truncates / pads back to the input
window length using the position of the first (left-most)
edit as the reference locus, so the side opposite the edit cluster
is the one trimmed.
Source code in geno_lewm/action/apply.py
| def apply_edits(
window: str,
edits: Sequence[RelEdit],
*,
preserve_length: bool = False,
) -> str:
"""Apply a sequence of edits to ``window``.
The edits are sorted by descending ``rel_pos`` and applied in that
order (INV-ARCH-4). Edits must not overlap in genomic coordinates;
overlap raises :class:`OverlappingEditsError`.
Equivalent inputs (same set of edits in any caller-supplied order)
produce equivalent outputs — the function is order-invariant after
the internal sort, which is the property the training pipeline
relies on.
The ``preserve_length`` flag truncates / pads back to the input
window length using the position of the **first** (left-most)
edit as the reference locus, so the side opposite the edit cluster
is the one trimmed.
"""
if not edits:
return window
_assert_disjoint(edits)
# Apply right-to-left. With preserve_length=False on the inner
# calls so we only truncate once at the end (intermediate lengths
# change with indels, which is fine).
ordered = sorted(edits, key=lambda e: e.rel_pos, reverse=True)
out = window
for edit in ordered:
out = apply_edit(out, edit, preserve_length=False)
if not preserve_length:
return out
leftmost = min(e.rel_pos for e in edits)
return _truncate_or_pad(out, len(window), edit_locus=leftmost)
|