geno_lewm.action¶
action
¶
Action representation for GenoLeWM.
Public surface defined by RFC-0003. The package ships the canonical edit types, pure-Python apply functions, synthetic samplers, and the optional PyTorch action encoder.
ActionEncoder
¶
ActionEncoder(*, d_action: int = 512, d_pos: int = 128, d_type: int = 64, d_seq: int = 256, max_window_bp: int = 12288, carbon_tokenizer: Any | None = None)
Bases: Module
Encode :class:RelEdit objects into learned action embeddings.
Source code in geno_lewm/action/encoder.py
EditSpec
dataclass
¶
A canonical, frozen genomic edit (RFC-0003 §3.1).
Construct with absolute VCF-style coordinates; the derived
:attr:edit_type is filled in by __post_init__.
pos is 1-based per VCF convention; both ref and alt are
explicit base strings (no <DEL> / <INS> symbolic alleles —
they're deferred to v2).
relative_to
¶
Return the window-relative form (RFC-0003 §3.3).
window_start_bp and window_end_bp are 0-based inclusive
coordinates on the same chromosome as :attr:chrom. The
predictor sees only the relative offset; absolute coordinates
never enter the model.
Source code in geno_lewm/action/spec.py
EditType
¶
Bases: IntEnum
The six v1 edit categories (RFC-0003 §3.2).
Members are deterministic functions of (len(ref), len(alt)) —
callers do not pass this value; it is computed during construction.
RelEdit
dataclass
¶
Window-relative form consumed by the action encoder.
apply_edit
¶
Return window with edit applied.
window is the pre-edit base string (uppercase ACGTN). The
function does not validate window contents beyond what the edit
locus requires; that is the caller's responsibility.
The reference bases at the edit locus must match edit.ref_bases
case-insensitively — otherwise :class:WindowMismatchError is
raised with the locus context attached.
Pass preserve_length=True to truncate / pad the result back to
the original window length on the side opposite the edit. The
default leaves the indel length change intact (length-preserving
is the trainer's responsibility for s_{t+1} encoding).
Source code in geno_lewm/action/apply.py
apply_edits
¶
Apply a sequence of edits to window.
The edits are sorted by descending rel_pos and applied in that
order (INV-ARCH-4). Edits must not overlap in genomic coordinates;
overlap raises :class:OverlappingEditsError.
Equivalent inputs (same set of edits in any caller-supplied order) produce equivalent outputs — the function is order-invariant after the internal sort, which is the property the training pipeline relies on.
The preserve_length flag truncates / pads back to the input
window length using the position of the first (left-most)
edit as the reference locus, so the side opposite the edit cluster
is the one trimmed.
Source code in geno_lewm/action/apply.py
indel
¶
indel(window: str, n: int, *, rng: Random, length_dist: Mapping[int, float] | Sequence[float] | None = None, type_mix: tuple[float, float] = (0.5, 0.5), edge_margin: int = DEFAULT_EDGE_MARGIN) -> list[RelEdit]
Sample n indels (INS or DEL).
length_dist is the event length (number of bases inserted or
deleted, exclusive of the VCF anchor base). Default is a truncated
geometric over [1, V1_MAX_LEN-1].
type_mix is (p_ins, p_del). Default 50/50.
Source code in geno_lewm/action/synthetic.py
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 | |
mnv
¶
mnv(window: str, n: int, *, rng: Random, length_dist: Mapping[int, float] | Sequence[float] | None = None, edge_margin: int = DEFAULT_EDGE_MARGIN) -> list[RelEdit]
Sample n MNVs (length-preserving multi-base substitutions).
Length is drawn from length_dist (default uniform over [2, 8]
per RFC text). The alt is guaranteed different from ref at every
base (otherwise constructing a RelEdit with that ref/alt would be
rejected by EditSpec validation).
Source code in geno_lewm/action/synthetic.py
uniform_snv
¶
uniform_snv(window: str, n: int, *, rng: Random, edge_margin: int = DEFAULT_EDGE_MARGIN) -> list[RelEdit]
Sample n uniform SNVs anchored inside window.
Each SNV's alt is uniformly drawn from the three non-reference
bases at the chosen position, so the contract "alt is always
non-reference" is enforced by construction.
Returns edits in the order they were sampled. The list may contain duplicates by position — the caller (data pipeline) is responsible for deduplication if it needs disjoint edits.