geno_lewm.data.builder¶
builder
¶
Training tuple builder for the RFC-0006 data pipeline.
This module owns the dependency-free boundary between prepared data sources and the eventual PyTorch trainer. It does not download gnomAD, ClinVar, or Carbon data. Instead, callers provide edit-source providers that are easy to unit-test with fixtures and later wire to real shards.
DEFAULT_EDIT_SOURCE_COUNTS
module-attribute
¶
DEFAULT_EDIT_SOURCE_COUNTS: tuple[EditSourceCount, ...] = (EditSourceCount(SOURCE_GNOMAD_COMMON, 3), EditSourceCount(SOURCE_SYNTHETIC_SNV, 3), EditSourceCount(SOURCE_SYNTHETIC_INDEL, 1), EditSourceCount(SOURCE_CLINVAR, 1))
RFC-0006 §3.3 per-window source allocation for N_edits = 8.
DEFAULT_SOURCE_FALLBACKS
module-attribute
¶
DEFAULT_SOURCE_FALLBACKS: dict[str, str] = {SOURCE_CLINVAR: SOURCE_SYNTHETIC_SNV, SOURCE_GNOMAD_COMMON: SOURCE_SYNTHETIC_SNV}
Default fallback when an absolute VCF edit is unavailable for a window.
ClinVar hard-negatives and gnomAD common variants are placed (absolute) sources: they only apply to windows that carry genome coordinates. On unplaced windows (the synthetic Carbon pretraining corpus) the absolute providers yield nothing and the builder draws synthetic SNVs instead, so pretraining-corpus windows still produce full edit tuples.
EditSourceCount
dataclass
¶
Number of edits to draw from one RFC-0006 source per window.
WindowContext
dataclass
¶
WindowContext(record_id: str, source: str, sequence: str, start_bp: int = 0, chrom: str | None = None)
One reference window plus source coordinates for tuple building.
Coordinates are 0-based half-open: start_bp is inclusive and
end_bp is exclusive. chrom is required for absolute variant
providers and chromosome/interval holdouts, but synthetic providers
can operate on unplaced Carbon windows.
HoldoutInterval
dataclass
¶
0-based half-open genomic interval excluded from training.
intersects
¶
Return whether [start_bp, end_bp) intersects this interval.
Source code in geno_lewm/data/builder.py
HoldoutPolicy
dataclass
¶
HoldoutPolicy(holdout_chroms: tuple[str, ...] = (), intervals: tuple[HoldoutInterval, ...] = (), edit_keys: tuple[str, ...] = (), record_ids: tuple[str, ...] = ())
Holdout exclusions enforced before a tuple reaches the trainer.
excludes_window
¶
Return whether the entire source window is in a holdout.
Source code in geno_lewm/data/builder.py
excludes_edit
¶
Return whether one relative edit intersects an edit-level holdout.
Source code in geno_lewm/data/builder.py
TrainingTuple
dataclass
¶
TrainingTuple(window_id: str, source_record_id: str, edit_source: str, rel_edits: tuple[RelEdit, ...], target_window: str, window_start_bp: int, window_end_bp: int)
One RFC-0006 (window_id, action, target_window) training item.
TrainingDatasetItem
dataclass
¶
One stream item with the source window needed for trainer encoding.
GenoLeWMDataset
¶
GenoLeWMDataset(windows: _WindowSource, providers: Mapping[str, _EditProvider], *, seed: int, mix: Sequence[EditSourceCount] = DEFAULT_EDIT_SOURCE_COUNTS, holdouts: HoldoutPolicy | None = None, fallback_sources: Mapping[str, str] | None = DEFAULT_SOURCE_FALLBACKS, preserve_length: bool = True)
Bases: _load_iterable_dataset_base()
Deterministic iterable dataset over windows and edit-source providers.
The class subclasses torch.utils.data.IterableDataset when torch
is installed, but falls back to a plain Python iterable in core/dev
environments. That keeps the data contract testable without pulling
in the full training extra.
Source code in geno_lewm/data/builder.py
__iter__
¶
iter_with_source_windows
¶
Yield tuples together with their source windows for trainer encoding.
Source code in geno_lewm/data/builder.py
build_training_tuples
¶
build_training_tuples(window: WindowContext, providers: Mapping[str, _EditProvider], *, rng: Random, mix: Sequence[EditSourceCount] = DEFAULT_EDIT_SOURCE_COUNTS, holdouts: HoldoutPolicy | None = None, fallback_sources: Mapping[str, str] | None = DEFAULT_SOURCE_FALLBACKS, preserve_length: bool = True) -> tuple[TrainingTuple, ...]
Build per-window training tuples with source mix and holdout checks.
providers map source names to callables returning relative edits
for that window. The default mix encodes RFC-0006's 3/3/1/1
gnomAD/synthetic-SNV/synthetic-indel/ClinVar allocation. If a source
cannot produce enough edits, only explicitly configured fallbacks are
used; missing gnomAD data therefore fails instead of silently turning
the training stream synthetic.
Source code in geno_lewm/data/builder.py
synthetic_snv_provider
¶
Provider for RFC-0006 uniform synthetic SNVs.
Source code in geno_lewm/data/builder.py
synthetic_indel_provider
¶
Provider for RFC-0006 synthetic indels.
Source code in geno_lewm/data/builder.py
variant_provider
¶
Return a provider backed by absolute VCF-style variants.