03 — Data model¶
Every persistent type, schema, and on-disk format that crosses a process
boundary is specified here. Schema versions are part of every artifact;
changes follow the policy in 09-release-and-versioning.md.
In-memory types¶
EditSpec¶
@dataclass(frozen=True, slots=True)
class EditSpec:
chrom: str # non-empty, e.g., "chr17" or "17"
pos: int # 1-based, VCF convention; >= 1
ref: str # uppercase ACGT only, len in [1, V1_MAX_LEN]
alt: str # uppercase ACGT only, len in [1, V1_MAX_LEN]
edit_type: EditType # derived from (len(ref), len(alt))
Constants:
V1_MAX_LEN = 16(bp). Edits longer than this raiseUnsupportedEditErrorwith anedit_type == SVpayload.
Invariants:
ref != alt(elseInvalidEditError).set(ref) ⊆ {A, C, G, T}, same foralt(elseInvalidEditError).chromis the contig name as it appears in the reference FASTA.posis 1-based, matching VCF semantics.
RelEdit¶
@dataclass(frozen=True, slots=True)
class RelEdit:
rel_pos: int # 0-based offset within window, in bp
edit_type: EditType
ref_bases: str
alt_bases: str
Invariants:
0 <= rel_pos < window_length_bp.window[rel_pos : rel_pos + len(ref_bases)] == ref_bases.upper()at apply time (elseWindowMismatchError).
TrainingTuple¶
@dataclass
class TrainingTuple:
window_id: str # SHA-256 (hex) of the reference window
rel_edit: RelEdit # the action
target_window: str # the edited window string
edit_source: Literal["gnomad", "synthetic_snv",
"synthetic_indel", "clinvar"]
Used internally by the data pipeline; not persisted as a tuple but
emitted in batches via IterableDataset.
SurpriseResult¶
See 02-public-api.md for the field list; consumers
treat it as immutable.
On-disk: window-embedding cache¶
- Format: Parquet shards.
- Path:
${GENO_LEWM_CACHE}/embeddings/{encoder_id}/{state_layer}/{pool_type}_{pool_radius}/chr{contig}_{stride_block}.parquet - Compression: Zstandard, level 9.
- One row per cached window.
Schema (Parquet)¶
| column | type | nullable | description |
|---|---|---|---|
chrom |
string | no | chromosome / contig |
start_bp |
int64 | no | inclusive |
end_bp |
int64 | no | exclusive (end_bp - start_bp == window_bp) |
window_hash |
binary(32) | no | SHA-256 of the uppercased ACGT string |
encoder_hash |
binary(32) | no | SHA-256 of the encoder weights file |
state_layer |
int8 | no | layer index used |
pool_type |
string | no | one of centered_mean, global_mean, attention |
pool_radius |
int32 | no | pool radius in tokens |
dtype |
string | no | one of bf16, fp16, fp32 |
embedding |
list |
no | the state vector; length == d_state |
untargeted |
bool | no | true iff no edit locus was specified |
created_at |
int64 | no | UTC unix nanoseconds |
schema_version |
string | no | always 1.0.0 for v0.1 |
The cache is content-addressed by
(window_hash, encoder_hash, state_layer, pool_type, pool_radius, dtype).
Changing any of these fields invalidates the cached entry; the cache
loader treats absence under a new key as a cache miss, not an error.
The cache writer never overwrites existing rows; a write that would duplicate a key is a no-op (post-hash equality check).
Cache index (SQLite)¶
A companion SQLite database ${GENO_LEWM_CACHE}/embeddings/index.sqlite
maps window_hash (hex, 64 chars) → (Parquet shard path, row offset).
CREATE TABLE window_index (
window_hash TEXT NOT NULL,
encoder_hash TEXT NOT NULL,
state_layer INTEGER NOT NULL,
pool_type TEXT NOT NULL,
pool_radius INTEGER NOT NULL,
dtype TEXT NOT NULL,
shard_path TEXT NOT NULL,
row_offset INTEGER NOT NULL,
created_at INTEGER NOT NULL,
PRIMARY KEY (window_hash, encoder_hash, state_layer, pool_type, pool_radius, dtype)
);
CREATE INDEX idx_shard_path ON window_index(shard_path);
The SQLite file is rebuildable from the Parquet shards at any time via
geno-lewm-cache-windows --reindex.
On-disk: gnomAD shard¶
- Path:
${GENO_LEWM_DATA}/gnomad/{release}/variants.parquet - Default release:
v4.1. - Builder:
geno-lewm-prepare-gnomad --input-vcf <local.vcf[.gz]> --output ${GENO_LEWM_DATA}.
| column | type | description |
|---|---|---|
chrom |
string | contig |
pos |
int64 | 1-based |
ref |
string | uppercase ACGT |
alt |
string | uppercase ACGT |
af_global |
float32 | global allele frequency |
af_afr |
float32 | African |
af_ami |
float32 | Amish |
af_amr |
float32 | Admixed American |
af_asj |
float32 | Ashkenazi Jewish |
af_eas |
float32 | East Asian |
af_fin |
float32 | Finnish |
af_nfe |
float32 | Non-Finnish European |
af_oth |
float32 | Other |
af_sas |
float32 | South Asian |
filter |
string | gnomAD VCF FILTER field |
schema_version |
string | always 1.0.0 for v0.1 |
Only variants with af_global >= 0.01 and filter == 'PASS' are included.
On-disk: ClinVar shard¶
- Path:
${GENO_LEWM_DATA}/clinvar/{release}/variants.parquet - Release is a date string like
2026-04-15matching NCBI's release. - Builder:
geno-lewm-prepare-clinvar --input-vcf <local.vcf[.gz]> --release {release} --output ${GENO_LEWM_DATA}.
| column | type | description |
|---|---|---|
chrom |
string | contig |
pos |
int64 | 1-based |
ref |
string | uppercase ACGT |
alt |
string | uppercase ACGT |
clinical_significance |
string | enum: P, LP, LB, B, VUS, OTHER |
review_status |
string | ClinVar review status |
gene_symbol |
string | nullable |
clinvar_id |
int64 | ClinVar variation ID |
schema_version |
string | always 1.0.0 for v0.1 |
VUS is included for completeness but excluded from eval label sets.
On-disk: dataset release package¶
The first paper/demo dataset snapshot is published as one directory:
geno-lewm-data-v0.1.0-r1/
├── data_card.md
├── dataset_package.json
├── dataset_manifest.json
├── split_integrity.json
├── SHA256SUMS
├── carbon/
│ └── ...
└── clinvar/
└── ...
python -m tools.release.dataset_package --dataset-dir DIR --metadata-json dataset_package.json
generates the normalized metadata file, card, manifest, split-integrity
report, and checksum file from already-built shards. The metadata JSON
records snapshot_id, upstream source
revisions, preprocessing commands, split policy, leakage checks,
intended use, limitations, and the relative paths of included files.
tools.release.paper_package rejects a dataset package when
data_card.md or dataset_manifest.json no longer matches
dataset_package.json.
The generated dataset_manifest.json is the machine-readable source of
truth for release packaging. It contains:
schema_version, fixed at1.0.0;snapshot_id, for examplegeno-lewm-data-v0.1.0-r1;sources, including upstream names and pinned revisions;splits, including per-split record counts;files, each with a relative path,sha256:<hex>digest,size_bytes, and optional split/description metadata.
split_integrity.json is generated from the manifest. It recomputes
file hashes/sizes, observes row counts for JSONL, VCF, text, CSV/TSV,
and Parquet files, records observed label/class balance from
clinical_significance, label, or VCF CLNSIG fields, and extracts
comparable keys from JSONL locus_key, variant_key,
variant-coordinate, or record_id rows, from VCF alternate rows, and
from Parquet rows that expose chrom, pos, ref, and alt columns.
It fails when no train/eval comparable-key comparison can be made. It
also records generated_by: tools.release.dataset_integrity; the
paper/demo verifier rejects reports with a missing or mismatched source
header. The generated data_card.md renders the same split-level class
balance from split_integrity.json.
On-disk: calibration table¶
- Path inside the checkpoint:
calibration.parquet - Built by:
geno-lewm-cache-windows --build-calibration - Schema (RFC-0009 §3.4):
| column | type | description |
|---|---|---|
bucket_id |
string | {region_class}\|{gc_bin}\|{repeat_class} |
n_calibration |
int64 | number of gnomAD variants in this bucket |
cdf |
list |
1001 points: F(σ_raw) at σ-grid quantiles |
sigma_grid |
list |
the σ_raw grid the CDF is evaluated on |
back_off_to |
string | parent bucket id if this bucket is sparse; nullable |
schema_version |
string | always 1.0.0 for v0.1 |
Bucket IDs are ASCII pipe-joined labels. Full buckets use
{region_class}|{gc_bin}|{repeat_class}. Parent buckets omit the
rightmost factors ({region_class}|{gc_bin}, then {region_class}),
and the final catch-all bucket is *.
The builder consumes pre-scored reference rows (bucket_id, sigma_raw)
and writes full, parent, and catch-all bucket CDFs. confidence and
low_confidence are derived at scoring time from the selected bucket's
n_calibration; they are not stored as separate Parquet columns.
On-disk: checkpoint directory¶
geno-lewm-v0.1.0-carbon-500m-r1/
├── manifest.json
├── predictor.safetensors
├── action_encoder.safetensors
├── calibration.parquet
├── train_config.yaml
├── model_package.json
├── eval_metrics.json
├── eval_report.md
├── efficiency_report.json
├── model_card.md
├── SHA256SUMS
├── encoder_hash.txt
├── tokenizer/ # symlink or copy of Carbon's tokenizer
└── lora/ # Phase 2+ only
└── carbon_lora.safetensors
The manifest.json schema is normative and frozen at v0.1; see
RFC-0011 §3.
All weight files use safetensors. Canonical serialization for hashing
sorts the state dict by key (UTF-8 lexicographic) before encoding.
Dataset release candidates are prepared by
python -m tools.release.dataset_snapshot --spec-json ... --dataset-dir ... --overwrite.
The snapshot spec names the pinned upstream Carbon files plus local
gnomAD and ClinVar VCF/VCF.gz release files. The command stages Carbon
files, builds gnomAD and ClinVar Parquet shards, writes
dataset_package.json, and runs the dataset-package generator so the
manifest, data card, split-integrity report, and checksums are generated
from the same local inputs. It also writes
dataset_snapshot_report.json, which records the checked spec hash and
upstream source file hashes using public-safe source references rather
than private absolute local paths. dataset_snapshot_report.json is
included in SHA256SUMS; tools.release.paper_package validates its
schema, package paths, generated metadata/manifest/data-card/integrity
hashes/sizes, staged file hashes/sizes/splits, and source references
before accepting a release dataset.
The checked first-experiment training/eval configs live under
configs/first_experiment/. geno-lewm-train --carbon-preflight
validates the training config with the closed GenoLeWM config schema and
records both the config hash and resolved payload in
training_preflight_report.json before a Carbon-backed run can launch.
The report uses public-relative path references plus hashes and sizes so
it can be published with a training run without leaking local filesystem
roots. Release training-run packages include that report in
training_run_SHA256SUMS; the paper/demo verifier rejects missing,
stale, non-ok, dataset-mismatched, config-mismatched, or private-path
preflight evidence.
Checkpoint release candidates are prepared by
python -m tools.release.eval_report,
python -m bench.inference --release-efficiency, and then
python -m tools.release.model_package. The eval-report helper renders
eval_report.md from packaged measured metrics JSON
(eval_metrics.json); the inference benchmark writes measured
efficiency_report.json with single-variant latency, batched
throughput, peak memory, command, hardware/runtime notes, samples,
warm-up, limitations, and input identities; metrics conclusions must
explicitly reference every measured metric name before those conclusions
can feed a paper/report draft; the model-package helper
writes normalized model_package.json, renders model_card.md, and
writes SHA256SUMS from manifest.json plus release metadata, with
eval_metrics.json and efficiency_report.json included as packaged
source files. The model and paper/package verifiers cross-check
eval/efficiency release id, dataset snapshot, commit, and model-result
identity against the manifest-backed package. They also re-render the
model card from model_package.json plus manifest.json and reject
stale output before release-candidate reports pass.
python -m tools.release.paper_draft renders the paper/report draft
from the same artifact set, including Artifact Availability entries for
model_package.json, dataset_package.json,
dataset_snapshot_report.json, eval_metrics.json,
eval_config.effective.yaml, eval_report.md,
efficiency_report.json, and the terminal demo evidence. The final
paper/demo package verifier checks that normalized
model-package metadata, the model card, generated eval report, packaged
metrics JSON, efficiency report, manifest artifact hashes, checksum file, data card,
dataset package metadata, dataset manifest, paper package, terminal demo
transcript,
runtime_preflight_report.json, and batch_receipt_report.json are
present and internally consistent; it re-renders the model card and
paper draft from the current artifacts and rejects stale Markdown.
python -m tools.release.release_candidate
then emits release_candidate_report.json, binding that package result
to the Hub dry-run plan, public artifact URL reachability checks,
commit SHA, model id, dataset snapshot, source metrics JSON, generated
eval report, efficiency report, model package metadata, dataset package
metadata, dataset snapshot report, manifest-backed
predictor/action/calibration/config artifact identities,
training_preflight_report.json, training_run_SHA256SUMS, Hub
model/dataset/demo upload inventories, provider-backed checks that the
public model, dataset, and demo listings expose the expected files, the
readiness checklist, and key artifact hashes before the artifacts are
linked from a public release.
On-disk: receipt¶
Canonical JSON: keys sorted lexicographically, no whitespace, UTF-8.
Schema is normative at version 1.0.0; see
RFC-0011 §5.
The v1 receipt commits one score output. Single-variant scoring writes
this file when the caller passes --receipt PATH. VCF scoring writes a
JSONL sidecar at --receipt PATH with one canonical v1 receipt per
scored alternate. Release demos first write
runtime_preflight_report.json from tools.release.runtime_preflight;
that report records the model/input artifact identities, required native
runtime dependency probes, backend probes, and the fail-closed network
guard for the terminal command. Release verification rejects preflight
reports generated with fixture/test manifest allowance enabled. Release
demos also write
batch_receipt_report.json from tools.release.batch_receipt_report;
this aggregate report does not replace per-row receipts, but verifies
the score/receipt row count, artifact hashes, model id, calibration
hash, runtime identity, row ordering, and score-output equality for the
batch. tools.release.paper_package also compares the batch report's
model id and calibration hash with the packaged model manifest before a
demo package is accepted.
Wire formats¶
- VCF / VCF.gz consumed at the CLI boundary.
cyvcf2is the parser (pinned ≥ 0.30 for indexed-VCF iterators). - FASTA consumed for reference genome assemblies. We require the
index (
.fai) to be present; if missing, the CLI builds it viapysam. - 23andMe / AncestryDNA / MyHeritage raw data consumed by the
desktop runtime; conversion is a local-only step that produces a VCF in
a tmpdir. These array formats do not include VCF
REFalleles; the converter requires a local reference-allele map keyed by(chrom, pos)and fails withVcfParseErrorwhen the reference allele is absent. The conversion is documented and tested per format. - Sequencing.com WGS JSON consumed where available; conversion supports VCF-equivalent variant rows with explicit reference and alternate alleles.
Schema versioning¶
- Every on-disk artifact carries a top-level
schema_versionfield. - Schema bumps follow semver; the contract is documented in
09-release-and-versioning.md. - Loaders accept any schema with the same MAJOR and ignore unknown
optional fields; unknown required fields raise
SchemaCompatError.
Invariants¶
| ID | Invariant | Enforced by |
|---|---|---|
| INV-DATA-1 | EditSpec validates ACGT-only bases at construction | EditSpec.__post_init__ |
| INV-DATA-2 | Window content is uppercased before hashing for the cache | encoder/windowing.py::canonicalize |
| INV-DATA-3 | Cache rows are immutable; no in-place updates | encoder/cache.py::write_shard |
| INV-DATA-4 | Manifest hashes are computed over canonical JSON (sorted keys, no whitespace) | provenance/hashing.py::canonical_json_sha256 |
| INV-DATA-5 | Calibration buckets back off in a fixed order: (region, gc, repeat) → (region, gc) → (region) → (*) | surprise/context.py::backoff_chain |
| INV-DATA-5A | Calibration table files match the documented Parquet schema exactly | surprise/calibration.py::read_calibration_table |
| INV-DATA-6 | gnomAD variants with filter != "PASS" are never used for calibration or training |
data/gnomad.py::filter_passing |
| INV-DATA-7 | ClinVar VUS rows are loaded but excluded from labelled eval | data/clinvar.py::label_set |
| INV-DATA-8 | All datetimes on disk are UTC ISO-8601 with second resolution; durations are integer nanoseconds | linter rule |
| INV-DATA-9 | Receipt JSON is canonical-JSON (sorted keys, no whitespace, UTF-8) | provenance/receipt.py::write |
| INV-DATA-10 | Cache reads never write back; cache writes never overwrite | both encoder/cache.py paths |
Open questions¶
| ID | Question | Owner | Target |
|---|---|---|---|
| OQ-DATA-1 | Whether to add a phase field to TrainingTuple for haplotype tuples vs single-edit tuples |
core | v0.2 |
| OQ-DATA-2 | Whether calibration tables should also store per-bucket bootstrap std for confidence-aware downstream use | core | v0.2 |
| OQ-DATA-3 | Whether gnomAD/ClinVar Parquet shards should be split by chromosome for selective loading | core | when corpus exceeds 50 GB |