Skip to content

geno_lewm.action.apply

apply

Pure-Python apply_edit / apply_edits helpers.

These functions produce the post-edit window string used to encode the training target s_{t+1} (RFC-0003 §3.7). They are pure, importable without torch, and load-bearing for both training and eval.

The multi-edit form applies edits right-to-left by descending rel_pos (INV-ARCH-4) so each edit's relative position stays valid through the sequence of mutations. Overlapping edits raise :class:geno_lewm.errors.OverlappingEditsError. Reference-bases mismatch raises :class:geno_lewm.errors.WindowMismatchError. Edits whose locus is outside the window raise :class:geno_lewm.errors.OutOfWindowError.

After applying edits the post-edit window length may change (indels). :func:apply_edit and :func:apply_edits accept an optional preserve_length=True argument that truncates / pads on the side opposite the edit to preserve maximum context around the locus.

apply_edit

apply_edit(window: str, edit: RelEdit, *, preserve_length: bool = False) -> str

Return window with edit applied.

window is the pre-edit base string (uppercase ACGTN). The function does not validate window contents beyond what the edit locus requires; that is the caller's responsibility.

The reference bases at the edit locus must match edit.ref_bases case-insensitively — otherwise :class:WindowMismatchError is raised with the locus context attached.

Pass preserve_length=True to truncate / pad the result back to the original window length on the side opposite the edit. The default leaves the indel length change intact (length-preserving is the trainer's responsibility for s_{t+1} encoding).

Source code in geno_lewm/action/apply.py
def apply_edit(window: str, edit: RelEdit, *, preserve_length: bool = False) -> str:
    """Return ``window`` with ``edit`` applied.

    ``window`` is the pre-edit base string (uppercase ACGTN). The
    function does not validate window contents beyond what the edit
    locus requires; that is the caller's responsibility.

    The reference bases at the edit locus must match ``edit.ref_bases``
    case-insensitively — otherwise :class:`WindowMismatchError` is
    raised with the locus context attached.

    Pass ``preserve_length=True`` to truncate / pad the result back to
    the original window length on the side opposite the edit. The
    default leaves the indel length change intact (length-preserving
    is the trainer's responsibility for ``s_{t+1}`` encoding).
    """
    original_len = len(window)
    end = edit.rel_pos + len(edit.ref_bases)
    if edit.rel_pos < 0 or end > original_len:
        raise OutOfWindowError(
            "edit locus is outside the window",
            details={
                "rel_pos": edit.rel_pos,
                "ref_len": len(edit.ref_bases),
                "window_len": original_len,
            },
        )

    observed = window[edit.rel_pos : end]
    if observed.upper() != edit.ref_bases.upper():
        raise WindowMismatchError(
            "window bases do not match edit.ref_bases at locus",
            details={
                "rel_pos": edit.rel_pos,
                "expected_ref": edit.ref_bases,
                "observed_ref": observed,
            },
            remediation="re-fetch the window, or correct the EditSpec.ref",
        )

    edited = window[: edit.rel_pos] + edit.alt_bases + window[end:]

    if not preserve_length:
        return edited

    return _truncate_or_pad(edited, original_len, edit_locus=edit.rel_pos)

apply_edits

apply_edits(window: str, edits: Sequence[RelEdit], *, preserve_length: bool = False) -> str

Apply a sequence of edits to window.

The edits are sorted by descending rel_pos and applied in that order (INV-ARCH-4). Edits must not overlap in genomic coordinates; overlap raises :class:OverlappingEditsError.

Equivalent inputs (same set of edits in any caller-supplied order) produce equivalent outputs — the function is order-invariant after the internal sort, which is the property the training pipeline relies on.

The preserve_length flag truncates / pads back to the input window length using the position of the first (left-most) edit as the reference locus, so the side opposite the edit cluster is the one trimmed.

Source code in geno_lewm/action/apply.py
def apply_edits(
    window: str,
    edits: Sequence[RelEdit],
    *,
    preserve_length: bool = False,
) -> str:
    """Apply a sequence of edits to ``window``.

    The edits are sorted by descending ``rel_pos`` and applied in that
    order (INV-ARCH-4). Edits must not overlap in genomic coordinates;
    overlap raises :class:`OverlappingEditsError`.

    Equivalent inputs (same set of edits in any caller-supplied order)
    produce equivalent outputs — the function is order-invariant after
    the internal sort, which is the property the training pipeline
    relies on.

    The ``preserve_length`` flag truncates / pads back to the input
    window length using the position of the **first** (left-most)
    edit as the reference locus, so the side opposite the edit cluster
    is the one trimmed.
    """
    if not edits:
        return window

    _assert_disjoint(edits)

    # Apply right-to-left. With preserve_length=False on the inner
    # calls so we only truncate once at the end (intermediate lengths
    # change with indels, which is fine).
    ordered = sorted(edits, key=lambda e: e.rel_pos, reverse=True)
    out = window
    for edit in ordered:
        out = apply_edit(out, edit, preserve_length=False)

    if not preserve_length:
        return out

    leftmost = min(e.rel_pos for e in edits)
    return _truncate_or_pad(out, len(window), edit_locus=leftmost)