geno_lewm.data¶
data
¶
Data pipeline helpers for GenoLeWM.
DEFAULT_EDIT_SOURCE_COUNTS
module-attribute
¶
DEFAULT_EDIT_SOURCE_COUNTS: tuple[EditSourceCount, ...] = (EditSourceCount(SOURCE_GNOMAD_COMMON, 3), EditSourceCount(SOURCE_SYNTHETIC_SNV, 3), EditSourceCount(SOURCE_SYNTHETIC_INDEL, 1), EditSourceCount(SOURCE_CLINVAR, 1))
RFC-0006 §3.3 per-window source allocation for N_edits = 8.
DEFAULT_SOURCE_FALLBACKS
module-attribute
¶
DEFAULT_SOURCE_FALLBACKS: dict[str, str] = {SOURCE_CLINVAR: SOURCE_SYNTHETIC_SNV, SOURCE_GNOMAD_COMMON: SOURCE_SYNTHETIC_SNV}
Default fallback when an absolute VCF edit is unavailable for a window.
ClinVar hard-negatives and gnomAD common variants are placed (absolute) sources: they only apply to windows that carry genome coordinates. On unplaced windows (the synthetic Carbon pretraining corpus) the absolute providers yield nothing and the builder draws synthetic SNVs instead, so pretraining-corpus windows still produce full edit tuples.
EditSourceCount
dataclass
¶
Number of edits to draw from one RFC-0006 source per window.
GenoLeWMDataset
¶
GenoLeWMDataset(windows: _WindowSource, providers: Mapping[str, _EditProvider], *, seed: int, mix: Sequence[EditSourceCount] = DEFAULT_EDIT_SOURCE_COUNTS, holdouts: HoldoutPolicy | None = None, fallback_sources: Mapping[str, str] | None = DEFAULT_SOURCE_FALLBACKS, preserve_length: bool = True)
Bases: _load_iterable_dataset_base()
Deterministic iterable dataset over windows and edit-source providers.
The class subclasses torch.utils.data.IterableDataset when torch
is installed, but falls back to a plain Python iterable in core/dev
environments. That keeps the data contract testable without pulling
in the full training extra.
Source code in geno_lewm/data/builder.py
__iter__
¶
iter_with_source_windows
¶
Yield tuples together with their source windows for trainer encoding.
Source code in geno_lewm/data/builder.py
HoldoutInterval
dataclass
¶
0-based half-open genomic interval excluded from training.
intersects
¶
Return whether [start_bp, end_bp) intersects this interval.
Source code in geno_lewm/data/builder.py
HoldoutPolicy
dataclass
¶
HoldoutPolicy(holdout_chroms: tuple[str, ...] = (), intervals: tuple[HoldoutInterval, ...] = (), edit_keys: tuple[str, ...] = (), record_ids: tuple[str, ...] = ())
Holdout exclusions enforced before a tuple reaches the trainer.
excludes_window
¶
Return whether the entire source window is in a holdout.
Source code in geno_lewm/data/builder.py
excludes_edit
¶
Return whether one relative edit intersects an edit-level holdout.
Source code in geno_lewm/data/builder.py
TrainingDatasetItem
dataclass
¶
One stream item with the source window needed for trainer encoding.
TrainingTuple
dataclass
¶
TrainingTuple(window_id: str, source_record_id: str, edit_source: str, rel_edits: tuple[RelEdit, ...], target_window: str, window_start_bp: int, window_end_bp: int)
One RFC-0006 (window_id, action, target_window) training item.
WindowContext
dataclass
¶
WindowContext(record_id: str, source: str, sequence: str, start_bp: int = 0, chrom: str | None = None)
One reference window plus source coordinates for tuple building.
Coordinates are 0-based half-open: start_bp is inclusive and
end_bp is exclusive. chrom is required for absolute variant
providers and chromosome/interval holdouts, but synthetic providers
can operate on unplaced Carbon windows.
ClinvarPrepareReport
dataclass
¶
ClinvarPrepareReport(output_path: Path, release: str, records_read: int, allele_records_seen: int, records_written: int, skipped_allele: int, size_bytes: int, already_exists: bool = False)
Summary emitted by geno-lewm-prepare-clinvar.
ClinvarVariant
dataclass
¶
ClinvarVariant(chrom: str, pos: int, ref: str, alt: str, clinical_significance: str, review_status: str, gene_symbol: str | None, clinvar_id: int, schema_version: str = CLINVAR_SCHEMA_VERSION)
One normalized ClinVar row.
CarbonCorpusConfig
dataclass
¶
CarbonCorpusConfig(dataset_id: str = DEFAULT_CARBON_DATASET_ID, dataset_config: str | None = None, revision: str | None = None, default_source: str | None = None, skip_invalid: bool = False, split: str = 'train', streaming: bool = True, subset_fraction: float = DEFAULT_PHASE1_SUBSET_FRACTION, subset_seed: int = 0, sequence_field: str = DEFAULT_SEQUENCE_FIELD, source_field: str = DEFAULT_SOURCE_FIELD, source_id_field: str = DEFAULT_SOURCE_ID_FIELD, window_bp: int = DEFAULT_WINDOW_BP, margin_bp: int = DEFAULT_CORPUS_MARGIN_BP, stride_bp: int = DEFAULT_CORPUS_STRIDE_BP)
Configuration for reading and windowing the Carbon pretraining corpus.
CarbonRecord
dataclass
¶
Canonicalized source sequence record from the Carbon corpus.
CarbonSourceMix
dataclass
¶
One source bucket in the RFC-0006 Carbon sub-mix.
CarbonWindow
dataclass
¶
GnomadPrepareReport
dataclass
¶
GnomadPrepareReport(output_path: Path, release: str, records_read: int, allele_records_seen: int, records_written: int, skipped_filter: int, skipped_af: int, skipped_allele: int, size_bytes: int, already_exists: bool = False)
Summary emitted by geno-lewm-prepare-gnomad.
GnomadVariant
dataclass
¶
GnomadVariant(chrom: str, pos: int, ref: str, alt: str, af_global: float, af_afr: float | None, af_ami: float | None, af_amr: float | None, af_asj: float | None, af_eas: float | None, af_fin: float | None, af_nfe: float | None, af_oth: float | None, af_sas: float | None, filter: str, schema_version: str = GNOMAD_SCHEMA_VERSION)
One normalized common-variant row for the gnomAD shard.
build_training_tuples
¶
build_training_tuples(window: WindowContext, providers: Mapping[str, _EditProvider], *, rng: Random, mix: Sequence[EditSourceCount] = DEFAULT_EDIT_SOURCE_COUNTS, holdouts: HoldoutPolicy | None = None, fallback_sources: Mapping[str, str] | None = DEFAULT_SOURCE_FALLBACKS, preserve_length: bool = True) -> tuple[TrainingTuple, ...]
Build per-window training tuples with source mix and holdout checks.
providers map source names to callables returning relative edits
for that window. The default mix encodes RFC-0006's 3/3/1/1
gnomAD/synthetic-SNV/synthetic-indel/ClinVar allocation. If a source
cannot produce enough edits, only explicitly configured fallbacks are
used; missing gnomAD data therefore fails instead of silently turning
the training stream synthetic.
Source code in geno_lewm/data/builder.py
synthetic_indel_provider
¶
Provider for RFC-0006 synthetic indels.
Source code in geno_lewm/data/builder.py
synthetic_snv_provider
¶
Provider for RFC-0006 uniform synthetic SNVs.
Source code in geno_lewm/data/builder.py
variant_provider
¶
Return a provider backed by absolute VCF-style variants.
Source code in geno_lewm/data/builder.py
iter_clinvar_shard
¶
Yield normalized ClinVar rows from a Parquet shard.
Source code in geno_lewm/data/clinvar.py
iter_clinvar_vcf_variants
¶
iter_clinvar_vcf_variants(input_vcf: str | Path, *, max_allele_len: int = 16) -> Iterator[ClinvarVariant]
Yield normalized ClinVar rows from a local VCF without writing a shard.
Source code in geno_lewm/data/clinvar.py
label_set
¶
Return ClinVar rows usable for labelled eval, excluding VUS/OTHER.
Source code in geno_lewm/data/clinvar.py
prepare_clinvar_shard
¶
prepare_clinvar_shard(input_vcf: str | Path, output_dir: str | Path, *, release: str, max_allele_len: int = 16, overwrite: bool = False) -> ClinvarPrepareReport
Normalize a local ClinVar VCF/VCF.gz into the release shard schema.
Source code in geno_lewm/data/clinvar.py
draw_source_counts
¶
draw_source_counts(n: int, *, rng: Random, mix: Sequence[CarbonSourceMix] = CARBON_SUBMIX) -> dict[str, int]
Draw n source samples and return counts by normalized source key.
Source code in geno_lewm/data/corpus.py
iter_carbon_records
¶
iter_carbon_records(rows: Iterable[Mapping[str, Any]], *, sequence_field: str = DEFAULT_SEQUENCE_FIELD, source_field: str = DEFAULT_SOURCE_FIELD, source_id_field: str = DEFAULT_SOURCE_ID_FIELD, subset_fraction: float = 1.0, subset_seed: int = 0, default_source: str | None = None, skip_invalid: bool = False) -> Iterator[CarbonRecord]
Yield canonical Carbon records from HF-style row mappings.
Single-source corpus configs (e.g. eukaryote_generator_10B_subset) do
not carry a per-row source_field; pass default_source to label every
record (it must still be a recognized source key). With skip_invalid,
rows whose sequence carries unsupported (non-ACGTN) bases are skipped rather
than raising — corpus shards occasionally contain IUPAC ambiguity codes.
Source code in geno_lewm/data/corpus.py
iter_record_windows
¶
iter_record_windows(record: CarbonRecord, *, window_bp: int = DEFAULT_WINDOW_BP, margin_bp: int = DEFAULT_CORPUS_MARGIN_BP, stride_bp: int = DEFAULT_CORPUS_STRIDE_BP, rng: Random | None = None) -> Iterator[CarbonWindow]
Yield canonical windows for one Carbon corpus record.
Source code in geno_lewm/data/corpus.py
iter_window_starts
¶
iter_window_starts(sequence_length: int, *, window_bp: int = DEFAULT_WINDOW_BP, margin_bp: int = DEFAULT_CORPUS_MARGIN_BP, stride_bp: int = DEFAULT_CORPUS_STRIDE_BP, rng: Random | None = None) -> Iterator[int]
Yield RFC-0006 window starts respecting margin and stride constraints.
Source code in geno_lewm/data/corpus.py
load_hf_carbon_records
¶
Load Carbon corpus records through Hugging Face datasets lazily.
Source code in geno_lewm/data/corpus.py
normalize_source_label
¶
Normalize a Carbon corpus source label to the RFC-0006 source key.
Source code in geno_lewm/data/corpus.py
sample_source
¶
Sample one source key from the configured RFC-0006 sub-mix.
stable_subset_includes
¶
Return whether record_id belongs to a deterministic corpus subset.
Source code in geno_lewm/data/corpus.py
iter_gnomad_shard
¶
Yield normalized gnomAD rows from a Parquet shard.
Source code in geno_lewm/data/gnomad.py
iter_gnomad_vcf_variants
¶
iter_gnomad_vcf_variants(input_vcf: str | Path, *, min_af: float = 0.01, max_allele_len: int = 16) -> Iterator[GnomadVariant]
Yield normalized rows from a local gnomAD VCF without writing a shard.
Source code in geno_lewm/data/gnomad.py
prepare_gnomad_shard
¶
prepare_gnomad_shard(input_vcf: str | Path, output_dir: str | Path, *, release: str = 'v4.1', min_af: float = 0.01, max_allele_len: int = 16, overwrite: bool = False) -> GnomadPrepareReport
Filter a local gnomAD VCF/VCF.gz into the release shard schema.
Source code in geno_lewm/data/gnomad.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | |