geno_lewm.encoder¶
encoder
¶
State-encoder input preparation and Carbon wrapper helpers.
The pure-Python windowing, pooling, and cache helpers import without the
ML runtime. CarbonStateEncoder loads the optional Transformers stack
only when callers construct it without injected model/tokenizer objects.
CacheReindexReport
dataclass
¶
Summary of a SQLite index rebuild.
CacheRepairReport
dataclass
¶
Summary of a repair pass over Parquet shards.
WindowCacheKey
dataclass
¶
WindowCacheKey(window_hash: bytes, encoder_hash: bytes, state_layer: int, pool_type: str, pool_radius: int, dtype: str)
Content-addressed key for a cached embedding row.
WindowCacheRecord
dataclass
¶
WindowCacheRecord(chrom: str, start_bp: int, end_bp: int, window_hash: bytes, encoder_hash: bytes, state_layer: int, pool_type: str, pool_radius: int, dtype: str, embedding: tuple[float, ...], untargeted: bool, created_at: int = 0, schema_version: str = CACHE_SCHEMA_VERSION)
One row in the window-embedding cache schema.
with_created_at
¶
Fill created_at with current UTC nanoseconds when absent.
Source code in geno_lewm/encoder/cache.py
CarbonStateEncoder
¶
CarbonStateEncoder(model_id: str, revision: str, *, dtype: str = 'bf16', state_layer: int = -1, pool_type: str = POOL_CENTERED_MEAN, pool_radius: int = DEFAULT_POOL_RADIUS_TOKENS, normalize: bool = True, lora_config: object | None = None, model: object | None = None, tokenizer: object | None = None, encoder_hash: bytes | str | None = None, local_files_only: bool = True, trust_remote_code: bool = False, device: str | None = None)
Encode DNA windows with Carbon hidden states plus deterministic pooling.
Source code in geno_lewm/encoder/carbon.py
encode
¶
encode_batch
¶
encode_batch(windows: Sequence[str], edit_loci: Sequence[int | None]) -> tuple[tuple[float, ...], ...]
Encode and pool a batch of DNA windows.
Source code in geno_lewm/encoder/carbon.py
PoolingResult
dataclass
¶
PoolingResult(vector: tuple[float, ...], pool_type: Literal['centered_mean', 'global_mean'], pool_radius: int, untargeted: bool, center_token: int | None, token_count: int)
Pooled state vector plus cache-key metadata.
as_cache_fields
¶
Return fields shared with the window-cache schema.
ExtractedWindow
dataclass
¶
ExtractedWindow(sequence: str, start_bp: int, end_bp: int, window_bp: int, edit_locus: int | None = None, relative_edit_locus: int | None = None, pad_right_bp: int = 0)
A fixed-size DNA window plus its source-coordinate metadata.
start_bp and end_bp are 0-based half-open coordinates in
the caller's source coordinate system. end_bp - start_bp always
equals window_bp even when the sequence had to be right-padded
past the available source bases; pad_right_bp records how many
trailing A bases were introduced.
as_tokenizer_input
¶
default_cache_dir
¶
read_embedding
¶
Return an embedding by content key, or None on cache miss.
Source code in geno_lewm/encoder/cache.py
reindex_cache
¶
Rebuild index.sqlite from every readable Parquet shard.
Source code in geno_lewm/encoder/cache.py
repair_cache
¶
Quarantine unreadable Parquet shards and rebuild the SQLite index.
Source code in geno_lewm/encoder/cache.py
shard_path_for
¶
shard_path_for(cache_dir: Path | str, *, encoder_id: str, state_layer: int, pool_type: str, pool_radius: int, contig: str, stride_block: int) -> Path
Return the canonical Parquet shard path for a cache block.
Source code in geno_lewm/encoder/cache.py
write_shard
¶
write_shard(cache_dir: Path | str, *, encoder_id: str, contig: str, stride_block: int, records: Sequence[WindowCacheRecord]) -> Path
Write one immutable Parquet shard and index its rows.
If the shard already exists with the same rows, this is a no-op. If it exists and new or conflicting rows are supplied, the function raises instead of rewriting in place (INV-DATA-3 / INV-DATA-10).
Source code in geno_lewm/encoder/cache.py
centered_mean
¶
centered_mean(hidden_states: Sequence[Sequence[float]], *, center_token: int, pool_radius: int = DEFAULT_POOL_RADIUS_TOKENS) -> tuple[float, ...]
Mean-pool the inclusive token span center_token ± pool_radius.
Source code in geno_lewm/encoder/pooling.py
global_mean
¶
Mean-pool every token vector in hidden_states.
pool_hidden_states
¶
pool_hidden_states(hidden_states: Sequence[Sequence[float]], *, edit_locus: int | None = None, pool_type: Literal['centered_mean', 'global_mean'] = POOL_CENTERED_MEAN, pool_radius: int = DEFAULT_POOL_RADIUS_TOKENS, token_bp: int = CARBON_TOKEN_BP) -> PoolingResult
Pool token-level hidden states into a state vector.
edit_locus is a 0-based base-pair offset within the encoder
window. When it is absent, RFC-0002 requires a global-mean fallback
tagged as untargeted=True so cache consumers do not mix arbitrary
reference-window embeddings with edit-local embeddings.
Source code in geno_lewm/encoder/pooling.py
canonicalize_dna
¶
Return uppercase DNA after validating the supported alphabet.
The cache hash invariant is based on uppercased window content, so
callers can hash raw source slices and already-canonical windows
interchangeably. N is accepted because reference FASTA and
edited windows may contain masked bases.
Source code in geno_lewm/encoder/windowing.py
extract_window
¶
extract_window(source_sequence: str, *, edit_locus: int | None = None, window_bp: int = DEFAULT_WINDOW_BP, assume_canonical: bool = False) -> ExtractedWindow
Extract a supported-width DNA window from source_sequence.
edit_locus is a 0-based offset in source_sequence. When it
is supplied the window is centered on that locus unless clamped by
source boundaries. When omitted, the source midpoint is used. If
the source is shorter than the requested window or the selected
interval extends past the right edge, trailing A bases are
appended per Carbon's tokenizer convention.
Set assume_canonical when source_sequence is already uppercase,
validated DNA (e.g. a contig from a loaded reference FASTA) to skip the
O(len) re-validation. Re-validating a whole chromosome once per variant
otherwise dominates VCF scoring wall-clock.
Source code in geno_lewm/encoder/windowing.py
pad_for_carbon_tokenizer
¶
Right-pad canonical DNA to Carbon's token multiple.
Source code in geno_lewm/encoder/windowing.py
window_sha256
¶
Return SHA-256 bytes for the canonicalized DNA sequence.
wrap_dna_for_tokenizer
¶
Return <dna>...</dna> input with Carbon-compatible padding.