geno_lewm.encoder.windowing¶
windowing
¶
Window extraction and Carbon tokenizer wrapping.
Defined by RFC-0002 §3.2 and the cache invariant INV-DATA-2. The
helpers in this module are deliberately pure Python: they canonicalize
DNA windows, extract fixed-size windows around an optional edit locus,
right-pad short source sequence with A, and produce the
<dna>...</dna> string consumed by Carbon's tokenizer.
ExtractedWindow
dataclass
¶
ExtractedWindow(sequence: str, start_bp: int, end_bp: int, window_bp: int, edit_locus: int | None = None, relative_edit_locus: int | None = None, pad_right_bp: int = 0)
A fixed-size DNA window plus its source-coordinate metadata.
start_bp and end_bp are 0-based half-open coordinates in
the caller's source coordinate system. end_bp - start_bp always
equals window_bp even when the sequence had to be right-padded
past the available source bases; pad_right_bp records how many
trailing A bases were introduced.
as_tokenizer_input
¶
canonicalize_dna
¶
Return uppercase DNA after validating the supported alphabet.
The cache hash invariant is based on uppercased window content, so
callers can hash raw source slices and already-canonical windows
interchangeably. N is accepted because reference FASTA and
edited windows may contain masked bases.
Source code in geno_lewm/encoder/windowing.py
window_sha256
¶
Return SHA-256 bytes for the canonicalized DNA sequence.
extract_window
¶
extract_window(source_sequence: str, *, edit_locus: int | None = None, window_bp: int = DEFAULT_WINDOW_BP, assume_canonical: bool = False) -> ExtractedWindow
Extract a supported-width DNA window from source_sequence.
edit_locus is a 0-based offset in source_sequence. When it
is supplied the window is centered on that locus unless clamped by
source boundaries. When omitted, the source midpoint is used. If
the source is shorter than the requested window or the selected
interval extends past the right edge, trailing A bases are
appended per Carbon's tokenizer convention.
Set assume_canonical when source_sequence is already uppercase,
validated DNA (e.g. a contig from a loaded reference FASTA) to skip the
O(len) re-validation. Re-validating a whole chromosome once per variant
otherwise dominates VCF scoring wall-clock.
Source code in geno_lewm/encoder/windowing.py
pad_for_carbon_tokenizer
¶
Right-pad canonical DNA to Carbon's token multiple.
Source code in geno_lewm/encoder/windowing.py
wrap_dna_for_tokenizer
¶
Return <dna>...</dna> input with Carbon-compatible padding.