03 — Data model¶

Status: Authoritative for v0.1
Companion RFCs: RFC-0002, RFC-0003, RFC-0006, RFC-0009, RFC-0011

Every persistent type, schema, and on-disk format that crosses a process boundary is specified here. Schema versions are part of every artifact; changes follow the policy in 09-release-and-versioning.md.

In-memory types¶

`EditSpec`¶

@dataclass(frozen=True, slots=True)
class EditSpec:
    chrom: str           # non-empty, e.g., "chr17" or "17"
    pos: int             # 1-based, VCF convention; >= 1
    ref: str             # uppercase ACGT only, len in [1, V1_MAX_LEN]
    alt: str             # uppercase ACGT only, len in [1, V1_MAX_LEN]
    edit_type: EditType  # derived from (len(ref), len(alt))

Constants:

V1_MAX_LEN = 16 (bp). Edits longer than this raise UnsupportedEditError with an edit_type == SV payload.

Invariants:

ref != alt (else InvalidEditError).
set(ref) ⊆ {A, C, G, T}, same for alt (else InvalidEditError).
chrom is the contig name as it appears in the reference FASTA.
pos is 1-based, matching VCF semantics.

`RelEdit`¶

@dataclass(frozen=True, slots=True)
class RelEdit:
    rel_pos: int        # 0-based offset within window, in bp
    edit_type: EditType
    ref_bases: str
    alt_bases: str

Invariants:

0 <= rel_pos < window_length_bp.
window[rel_pos : rel_pos + len(ref_bases)] == ref_bases.upper() at apply time (else WindowMismatchError).

`TrainingTuple`¶

@dataclass
class TrainingTuple:
    window_id: str                   # SHA-256 (hex) of the reference window
    rel_edit: RelEdit                # the action
    target_window: str               # the edited window string
    edit_source: Literal["gnomad", "synthetic_snv",
                         "synthetic_indel", "clinvar"]

Used internally by the data pipeline; not persisted as a tuple but emitted in batches via IterableDataset.

`SurpriseResult`¶

See 02-public-api.md for the field list; consumers treat it as immutable.

On-disk: window-embedding cache¶

Format: Parquet shards.
Path: ${GENO_LEWM_CACHE}/embeddings/{encoder_id}/{state_layer}/{pool_type}_{pool_radius}/chr{contig}_{stride_block}.parquet
Compression: Zstandard, level 9.
One row per cached window.

Schema (Parquet)¶

column	type	nullable	description
`chrom`	string	no	chromosome / contig
`start_bp`	int64	no	inclusive
`end_bp`	int64	no	exclusive (`end_bp - start_bp == window_bp`)
`window_hash`	binary(32)	no	SHA-256 of the uppercased ACGT string
`encoder_hash`	binary(32)	no	SHA-256 of the encoder weights file
`state_layer`	int8	no	layer index used
`pool_type`	string	no	one of `centered_mean`, `global_mean`, `attention`
`pool_radius`	int32	no	pool radius in tokens
`dtype`	string	no	one of `bf16`, `fp16`, `fp32`
`embedding`	list	no	the state vector; length == d_state
`untargeted`	bool	no	true iff no edit locus was specified
`created_at`	int64	no	UTC unix nanoseconds
`schema_version`	string	no	always `1.0.0` for v0.1

The cache is content-addressed by (window_hash, encoder_hash, state_layer, pool_type, pool_radius, dtype). Changing any of these fields invalidates the cached entry; the cache loader treats absence under a new key as a cache miss, not an error.

The cache writer never overwrites existing rows; a write that would duplicate a key is a no-op (post-hash equality check).

Cache index (SQLite)¶

A companion SQLite database ${GENO_LEWM_CACHE}/embeddings/index.sqlite maps window_hash (hex, 64 chars) → (Parquet shard path, row offset).

CREATE TABLE window_index (
    window_hash TEXT NOT NULL,
    encoder_hash TEXT NOT NULL,
    state_layer INTEGER NOT NULL,
    pool_type TEXT NOT NULL,
    pool_radius INTEGER NOT NULL,
    dtype TEXT NOT NULL,
    shard_path TEXT NOT NULL,
    row_offset INTEGER NOT NULL,
    created_at INTEGER NOT NULL,
    PRIMARY KEY (window_hash, encoder_hash, state_layer, pool_type, pool_radius, dtype)
);
CREATE INDEX idx_shard_path ON window_index(shard_path);

The SQLite file is rebuildable from the Parquet shards at any time via geno-lewm-cache-windows --reindex.

On-disk: gnomAD shard¶

Path: ${GENO_LEWM_DATA}/gnomad/{release}/variants.parquet
Default release: v4.1.
Builder: geno-lewm-prepare-gnomad --input-vcf <local.vcf[.gz]> --output ${GENO_LEWM_DATA}.

column	type	description
`chrom`	string	contig
`pos`	int64	1-based
`ref`	string	uppercase ACGT
`alt`	string	uppercase ACGT
`af_global`	float32	global allele frequency
`af_afr`	float32	African
`af_ami`	float32	Amish
`af_amr`	float32	Admixed American
`af_asj`	float32	Ashkenazi Jewish
`af_eas`	float32	East Asian
`af_fin`	float32	Finnish
`af_nfe`	float32	Non-Finnish European
`af_oth`	float32	Other
`af_sas`	float32	South Asian
`filter`	string	gnomAD VCF FILTER field
`schema_version`	string	always `1.0.0` for v0.1

Only variants with af_global >= 0.01 and filter == 'PASS' are included.

On-disk: ClinVar shard¶

Path: ${GENO_LEWM_DATA}/clinvar/{release}/variants.parquet
Release is a date string like 2026-04-15 matching NCBI's release.
Builder: geno-lewm-prepare-clinvar --input-vcf <local.vcf[.gz]> --release {release} --output ${GENO_LEWM_DATA}.

column	type	description
`chrom`	string	contig
`pos`	int64	1-based
`ref`	string	uppercase ACGT
`alt`	string	uppercase ACGT
`clinical_significance`	string	enum: `P`, `LP`, `LB`, `B`, `VUS`, `OTHER`
`review_status`	string	ClinVar review status
`gene_symbol`	string	nullable
`clinvar_id`	int64	ClinVar variation ID
`schema_version`	string	always `1.0.0` for v0.1

VUS is included for completeness but excluded from eval label sets.

On-disk: dataset release package¶

The first paper/demo dataset snapshot is published as one directory:

geno-lewm-data-v0.1.0-r1/
├── data_card.md
├── dataset_package.json
├── dataset_manifest.json
├── split_integrity.json
├── SHA256SUMS
├── carbon/
│   └── ...
└── clinvar/
    └── ...

python -m tools.release.dataset_package --dataset-dir DIR --metadata-json dataset_package.json generates the normalized metadata file, card, manifest, split-integrity report, and checksum file from already-built shards. The metadata JSON records snapshot_id, upstream source revisions, preprocessing commands, split policy, leakage checks, intended use, limitations, and the relative paths of included files. tools.release.paper_package rejects a dataset package when data_card.md or dataset_manifest.json no longer matches dataset_package.json.

The generated dataset_manifest.json is the machine-readable source of truth for release packaging. It contains:

schema_version, fixed at 1.0.0;
snapshot_id, for example geno-lewm-data-v0.1.0-r1;
sources, including upstream names and pinned revisions;
splits, including per-split record counts;
files, each with a relative path, sha256:<hex> digest, size_bytes, and optional split/description metadata.

split_integrity.json is generated from the manifest. It recomputes file hashes/sizes, observes row counts for JSONL, VCF, text, CSV/TSV, and Parquet files, records observed label/class balance from clinical_significance, label, or VCF CLNSIG fields, and extracts comparable keys from JSONL locus_key, variant_key, variant-coordinate, or record_id rows, from VCF alternate rows, and from Parquet rows that expose chrom, pos, ref, and alt columns. It fails when no train/eval comparable-key comparison can be made. It also records generated_by: tools.release.dataset_integrity; the paper/demo verifier rejects reports with a missing or mismatched source header. The generated data_card.md renders the same split-level class balance from split_integrity.json.

On-disk: calibration table¶

Path inside the checkpoint: calibration.parquet
Built by: geno-lewm-cache-windows --build-calibration
Schema (RFC-0009 §3.4):

column	type	description
`bucket_id`	string	`{region_class}\\|{gc_bin}\\|{repeat_class}`
`n_calibration`	int64	number of gnomAD variants in this bucket
`cdf`	list	1001 points: F(σ_raw) at σ-grid quantiles
`sigma_grid`	list	the σ_raw grid the CDF is evaluated on
`back_off_to`	string	parent bucket id if this bucket is sparse; nullable
`schema_version`	string	always `1.0.0` for v0.1

Bucket IDs are ASCII pipe-joined labels. Full buckets use {region_class}|{gc_bin}|{repeat_class}. Parent buckets omit the rightmost factors ({region_class}|{gc_bin}, then {region_class}), and the final catch-all bucket is *.

The builder consumes pre-scored reference rows (bucket_id, sigma_raw) and writes full, parent, and catch-all bucket CDFs. confidence and low_confidence are derived at scoring time from the selected bucket's n_calibration; they are not stored as separate Parquet columns.

On-disk: checkpoint directory¶

geno-lewm-v0.1.0-carbon-500m-r1/
├── manifest.json
├── predictor.safetensors
├── action_encoder.safetensors
├── calibration.parquet
├── train_config.yaml
├── model_package.json
├── eval_metrics.json
├── eval_report.md
├── efficiency_report.json
├── model_card.md
├── SHA256SUMS
├── encoder_hash.txt
├── tokenizer/              # symlink or copy of Carbon's tokenizer
└── lora/                   # Phase 2+ only
    └── carbon_lora.safetensors

The manifest.json schema is normative and frozen at v0.1; see RFC-0011 §3.

All weight files use safetensors. Canonical serialization for hashing sorts the state dict by key (UTF-8 lexicographic) before encoding.

Dataset release candidates are prepared by python -m tools.release.dataset_snapshot --spec-json ... --dataset-dir ... --overwrite. The snapshot spec names the pinned upstream Carbon files plus local gnomAD and ClinVar VCF/VCF.gz release files. The command stages Carbon files, builds gnomAD and ClinVar Parquet shards, writes dataset_package.json, and runs the dataset-package generator so the manifest, data card, split-integrity report, and checksums are generated from the same local inputs. It also writes dataset_snapshot_report.json, which records the checked spec hash and upstream source file hashes using public-safe source references rather than private absolute local paths. dataset_snapshot_report.json is included in SHA256SUMS; tools.release.paper_package validates its schema, package paths, generated metadata/manifest/data-card/integrity hashes/sizes, staged file hashes/sizes/splits, and source references before accepting a release dataset.

The checked first-experiment training/eval configs live under configs/first_experiment/. geno-lewm-train --carbon-preflight validates the training config with the closed GenoLeWM config schema and records both the config hash and resolved payload in training_preflight_report.json before a Carbon-backed run can launch. The report uses public-relative path references plus hashes and sizes so it can be published with a training run without leaking local filesystem roots. Release training-run packages include that report in training_run_SHA256SUMS; the paper/demo verifier rejects missing, stale, non-ok, dataset-mismatched, config-mismatched, or private-path preflight evidence.

Checkpoint release candidates are prepared by python -m tools.release.eval_report, python -m bench.inference --release-efficiency, and then python -m tools.release.model_package. The eval-report helper renders eval_report.md from packaged measured metrics JSON (eval_metrics.json); the inference benchmark writes measured efficiency_report.json with single-variant latency, batched throughput, peak memory, command, hardware/runtime notes, samples, warm-up, limitations, and input identities; metrics conclusions must explicitly reference every measured metric name before those conclusions can feed a paper/report draft; the model-package helper writes normalized model_package.json, renders model_card.md, and writes SHA256SUMS from manifest.json plus release metadata, with eval_metrics.json and efficiency_report.json included as packaged source files. The model and paper/package verifiers cross-check eval/efficiency release id, dataset snapshot, commit, and model-result identity against the manifest-backed package. They also re-render the model card from model_package.json plus manifest.json and reject stale output before release-candidate reports pass. python -m tools.release.paper_draft renders the paper/report draft from the same artifact set, including Artifact Availability entries for model_package.json, dataset_package.json, dataset_snapshot_report.json, eval_metrics.json, eval_config.effective.yaml, eval_report.md, efficiency_report.json, and the terminal demo evidence. The final paper/demo package verifier checks that normalized model-package metadata, the model card, generated eval report, packaged metrics JSON, efficiency report, manifest artifact hashes, checksum file, data card, dataset package metadata, dataset manifest, paper package, terminal demo transcript, runtime_preflight_report.json, and batch_receipt_report.json are present and internally consistent; it re-renders the model card and paper draft from the current artifacts and rejects stale Markdown. python -m tools.release.release_candidate then emits release_candidate_report.json, binding that package result to the Hub dry-run plan, public artifact URL reachability checks, commit SHA, model id, dataset snapshot, source metrics JSON, generated eval report, efficiency report, model package metadata, dataset package metadata, dataset snapshot report, manifest-backed predictor/action/calibration/config artifact identities, training_preflight_report.json, training_run_SHA256SUMS, Hub model/dataset/demo upload inventories, provider-backed checks that the public model, dataset, and demo listings expose the expected files, the readiness checklist, and key artifact hashes before the artifacts are linked from a public release.

On-disk: receipt¶

--receipt PATH

Canonical JSON: keys sorted lexicographically, no whitespace, UTF-8. Schema is normative at version 1.0.0; see RFC-0011 §5. The v1 receipt commits one score output. Single-variant scoring writes this file when the caller passes --receipt PATH. VCF scoring writes a JSONL sidecar at --receipt PATH with one canonical v1 receipt per scored alternate. Release demos first write runtime_preflight_report.json from tools.release.runtime_preflight; that report records the model/input artifact identities, required native runtime dependency probes, backend probes, and the fail-closed network guard for the terminal command. Release verification rejects preflight reports generated with fixture/test manifest allowance enabled. Release demos also write batch_receipt_report.json from tools.release.batch_receipt_report; this aggregate report does not replace per-row receipts, but verifies the score/receipt row count, artifact hashes, model id, calibration hash, runtime identity, row ordering, and score-output equality for the batch. tools.release.paper_package also compares the batch report's model id and calibration hash with the packaged model manifest before a demo package is accepted.

Wire formats¶

VCF / VCF.gz consumed at the CLI boundary. cyvcf2 is the parser (pinned ≥ 0.30 for indexed-VCF iterators).
FASTA consumed for reference genome assemblies. We require the index (.fai) to be present; if missing, the CLI builds it via pysam.
23andMe / AncestryDNA / MyHeritage raw data consumed by the desktop runtime; conversion is a local-only step that produces a VCF in a tmpdir. These array formats do not include VCF REF alleles; the converter requires a local reference-allele map keyed by (chrom, pos) and fails with VcfParseError when the reference allele is absent. The conversion is documented and tested per format.
Sequencing.com WGS JSON consumed where available; conversion supports VCF-equivalent variant rows with explicit reference and alternate alleles.

Schema versioning¶

Every on-disk artifact carries a top-level schema_version field.
Schema bumps follow semver; the contract is documented in 09-release-and-versioning.md.
Loaders accept any schema with the same MAJOR and ignore unknown optional fields; unknown required fields raise SchemaCompatError.

Invariants¶

ID	Invariant	Enforced by
INV-DATA-1	EditSpec validates ACGT-only bases at construction	`EditSpec.__post_init__`
INV-DATA-2	Window content is uppercased before hashing for the cache	`encoder/windowing.py::canonicalize`
INV-DATA-3	Cache rows are immutable; no in-place updates	`encoder/cache.py::write_shard`
INV-DATA-4	Manifest hashes are computed over canonical JSON (sorted keys, no whitespace)	`provenance/hashing.py::canonical_json_sha256`
INV-DATA-5	Calibration buckets back off in a fixed order: (region, gc, repeat) → (region, gc) → (region) → (*)	`surprise/context.py::backoff_chain`
INV-DATA-5A	Calibration table files match the documented Parquet schema exactly	`surprise/calibration.py::read_calibration_table`
INV-DATA-6	gnomAD variants with `filter != "PASS"` are never used for calibration or training	`data/gnomad.py::filter_passing`
INV-DATA-7	ClinVar VUS rows are loaded but excluded from labelled eval	`data/clinvar.py::label_set`
INV-DATA-8	All datetimes on disk are UTC ISO-8601 with second resolution; durations are integer nanoseconds	linter rule
INV-DATA-9	Receipt JSON is canonical-JSON (sorted keys, no whitespace, UTF-8)	`provenance/receipt.py::write`
INV-DATA-10	Cache reads never write back; cache writes never overwrite	both `encoder/cache.py` paths

Open questions¶

ID	Question	Owner	Target
OQ-DATA-1	Whether to add a `phase` field to TrainingTuple for haplotype tuples vs single-edit tuples	core	v0.2
OQ-DATA-2	Whether calibration tables should also store per-bucket bootstrap std for confidence-aware downstream use	core	v0.2
OQ-DATA-3	Whether gnomAD/ClinVar Parquet shards should be split by chromosome for selective loading	core	when corpus exceeds 50 GB