geno_lewm.data.corpus¶
corpus
¶
Carbon pretraining corpus records and RFC-0006 window sampling.
CarbonSourceMix
dataclass
¶
One source bucket in the RFC-0006 Carbon sub-mix.
CarbonCorpusConfig
dataclass
¶
CarbonCorpusConfig(dataset_id: str = DEFAULT_CARBON_DATASET_ID, dataset_config: str | None = None, revision: str | None = None, default_source: str | None = None, skip_invalid: bool = False, split: str = 'train', streaming: bool = True, subset_fraction: float = DEFAULT_PHASE1_SUBSET_FRACTION, subset_seed: int = 0, sequence_field: str = DEFAULT_SEQUENCE_FIELD, source_field: str = DEFAULT_SOURCE_FIELD, source_id_field: str = DEFAULT_SOURCE_ID_FIELD, window_bp: int = DEFAULT_WINDOW_BP, margin_bp: int = DEFAULT_CORPUS_MARGIN_BP, stride_bp: int = DEFAULT_CORPUS_STRIDE_BP)
Configuration for reading and windowing the Carbon pretraining corpus.
CarbonRecord
dataclass
¶
Canonicalized source sequence record from the Carbon corpus.
CarbonWindow
dataclass
¶
normalize_source_label
¶
Normalize a Carbon corpus source label to the RFC-0006 source key.
Source code in geno_lewm/data/corpus.py
sample_source
¶
Sample one source key from the configured RFC-0006 sub-mix.
draw_source_counts
¶
draw_source_counts(n: int, *, rng: Random, mix: Sequence[CarbonSourceMix] = CARBON_SUBMIX) -> dict[str, int]
Draw n source samples and return counts by normalized source key.
Source code in geno_lewm/data/corpus.py
stable_subset_includes
¶
Return whether record_id belongs to a deterministic corpus subset.
Source code in geno_lewm/data/corpus.py
iter_window_starts
¶
iter_window_starts(sequence_length: int, *, window_bp: int = DEFAULT_WINDOW_BP, margin_bp: int = DEFAULT_CORPUS_MARGIN_BP, stride_bp: int = DEFAULT_CORPUS_STRIDE_BP, rng: Random | None = None) -> Iterator[int]
Yield RFC-0006 window starts respecting margin and stride constraints.
Source code in geno_lewm/data/corpus.py
iter_record_windows
¶
iter_record_windows(record: CarbonRecord, *, window_bp: int = DEFAULT_WINDOW_BP, margin_bp: int = DEFAULT_CORPUS_MARGIN_BP, stride_bp: int = DEFAULT_CORPUS_STRIDE_BP, rng: Random | None = None) -> Iterator[CarbonWindow]
Yield canonical windows for one Carbon corpus record.
Source code in geno_lewm/data/corpus.py
iter_carbon_records
¶
iter_carbon_records(rows: Iterable[Mapping[str, Any]], *, sequence_field: str = DEFAULT_SEQUENCE_FIELD, source_field: str = DEFAULT_SOURCE_FIELD, source_id_field: str = DEFAULT_SOURCE_ID_FIELD, subset_fraction: float = 1.0, subset_seed: int = 0, default_source: str | None = None, skip_invalid: bool = False) -> Iterator[CarbonRecord]
Yield canonical Carbon records from HF-style row mappings.
Single-source corpus configs (e.g. eukaryote_generator_10B_subset) do
not carry a per-row source_field; pass default_source to label every
record (it must still be a recognized source key). With skip_invalid,
rows whose sequence carries unsupported (non-ACGTN) bases are skipped rather
than raising — corpus shards occasionally contain IUPAC ambiguity codes.
Source code in geno_lewm/data/corpus.py
load_hf_carbon_records
¶
Load Carbon corpus records through Hugging Face datasets lazily.