Design decisions log¶
A record of resolved design trade-offs across GenoLeWM. The RFCs contain the rationale for each decision; this document is the index — a single place to look up "what did we decide about X and where is the justification?"
When an RFC ships a new resolved decision, add an entry here. When a decision is amended, append a new entry (do not edit the old one); the history is part of the value.
Architecture¶
State encoder is Carbon-500M, frozen in Phase 1¶
- Decided: 2026-05-20
- RFC: 0002 §3.1, §4.2
- Rationale (short): Carbon-500M is the smallest model that meets the published quality bar, fits the consumer-hardware deployment target, and unlocks single-GPU training when frozen.
State vectors are L2-normalized at the encoder¶
- Decided: 2026-05-20
- RFC: 0002 §3.5, §4.5
- Rationale (short): Stable cosine + MSE loss combination; stable distance-based surprise calculation.
Window length is 12,288 bp (2,048 6-mer tokens)¶
- Decided: 2026-05-20
- RFC: 0002 §3.2, §4.4
- Rationale (short): Middle ground between exon coverage and encoding cost.
Pooling is centered-mean over ± 256 tokens¶
- Decided: 2026-05-20
- RFC: 0002 §3.4, §4.1
- Rationale (short): Edit-local; no extra parameters; outperforms global mean on edit-sensitive tasks (to be confirmed by ablation).
Action encoder uses four sub-encoders (position, type, ref, alt)¶
- Decided: 2026-05-20
- RFC: 0003 §3.4, §4.1
- Rationale (short): Inductive bias matching structure of an edit; shared ref/alt SeqMicroEncoder enforces compositional generalization.
v1 caps len(ref) and len(alt) at 16 bp¶
- Decided: 2026-05-20
- RFC: 0003 §3.1, §3.5, §4.3
- Rationale (short): Covers > 95% of clinically relevant short variants; SVs require separate adapter (v2 RFC).
Predictor is cross-attention Transformer (4 cross + 2 self blocks)¶
- Decided: 2026-05-20
- RFC: 0004 §3.1, §4.1
- Rationale (short): Variable-length action sequences without arch change; cross-attention exposes structured action sub-embeddings to state.
Predictor output MLP final layer is zero-initialized¶
- Decided: 2026-05-20
- RFC: 0004 §3.4, §4.3
- Rationale (short): Identity-at-init; predictor starts by
outputting
s_t, making early training stable.
Training¶
Loss is α · (1 − cos) + β · MSE / d_state¶
- Decided: 2026-05-20
- RFC: 0005 §3.1, §4.1
- Rationale (short): Cosine for direction, MSE for magnitude calibration; matches LeWM recipe ported to L2-normalized embeddings.
LeJEPA regularizer is monitored-only in Phase 1¶
- Decided: 2026-05-20
- RFC: 0005 §3.2, §3.3, §4.4
- Rationale (short): Frozen encoder → collapse impossible → regularizer not needed as training term; computed for monitoring to catch unexpected drift.
Optimizer is AdamW with β₂ = 0.95¶
- Decided: 2026-05-20
- RFC: 0005 §3.4
- Rationale (short): Stability with small batches over high-dimensional latents; standard for JEPA training.
LR schedule is WSD (warmup-stable-decay)¶
- Decided: 2026-05-20
- RFC: 0005 §3.5, §4.3
- Rationale (short): Phase-transition friendly; checkpoint at the end of stable phase, continue training with fresh decay schedule when LoRA is enabled.
Batch size 256, edit-balanced sampling¶
- Decided: 2026-05-20
- RFC: 0005 §3.7, §4.5, §4.6
- Rationale (short): Matches LeWM; supports stable covariance estimation in Phase 2; per-type balance gives indels enough training signal.
Data¶
Reference corpus is HuggingFaceBio/carbon-pretraining-corpus¶
- Decided: 2026-05-20
- RFC: 0006 §3.1, §4.1
- Rationale (short): In-distribution for Carbon → most reliable encoder outputs; pre-processed and tokenization-validated; public.
Edit-source mix is 40 gnomAD / 30 synthetic SNV / 20 synthetic indel / 10 ClinVar¶
- Decided: 2026-05-20
- RFC: 0006 §3.3, §4.2
- Rationale (short): Balance of realism (gnomAD), action coverage (synthetic), and hard signal (ClinVar).
Windows overlap at 67% (stride 8,192 bp)¶
- Decided: 2026-05-20
- RFC: 0006 §3.2, §4.3
- Rationale (short): Each position covered by ~3 windows; gives predictor multiple contexts per variant.
Three holdouts: holdout-chr (chr21), holdout-clinvar, holdout-haplotypes¶
- Decided: 2026-05-20
- RFC: 0006 §3.8, §4.5
- Rationale (short): Clean spatial generalization (entire chromosome); clean known-pathogenic generalization (ClinVar P/LP); clean multi-edit generalization (gnomAD haplotypes).
Evaluation¶
VEP benchmarks mirror Carbon's published suite¶
- Decided: 2026-05-20
- RFC: 0007 §3.1, §4.1
- Rationale (short): Direct comparability with Carbon's model card numbers.
Two scoring heads reported: surprise and displacement¶
- Decided: 2026-05-20
- RFC: 0007 §3.1.2, §4.2
- Rationale (short): Different uses → different signals; prevents optimizing one at the cost of the other.
Rollout fidelity reported per-K with a naive baseline¶
- Decided: 2026-05-20
- RFC: 0007 §3.2, §4.3
- Rationale (short): Catches degenerate predictors that output
s_tregardless of action.
Efficiency benchmarks include Apple M3 Max¶
- Decided: 2026-05-20
- RFC: 0007 §3.3, §4.4
- Rationale (short): Freedom-tech / personal-genome target audience skews Mac; first-class target for Phase 3 honesty.
Planning¶
Default solver is CEM¶
- Decided: 2026-05-20
- RFC: 0008 §3.4, §4.1
- Rationale (short): Discrete edit space; no per-task training; fast enough on H100 to amortize per query.
Planning never calls Carbon during search¶
- Decided: 2026-05-20
- RFC: 0008 §2
- Rationale (short): Efficiency thesis of the world-model framing; pay for Carbon once, run thousands of CEM rollouts at predictor cost.
Surprise¶
Calibrated surprise is the published score; raw residual is also exposed¶
- Decided: 2026-05-20
- RFC: 0009 §3.5, §3.7, §4.4
- Rationale (short): Context-aware percentile is interpretable; raw exposed for debugging and recalibration.
Calibration distribution is gnomAD common variants (AF ≥ 1%)¶
- Decided: 2026-05-20
- RFC: 0009 §3.4, §4.2
- Rationale (short): Biology's tolerated background; appropriate null model.
Calibration buckets by (region_class, gc_bin, repeat_class) with back-off¶
- Decided: 2026-05-20
- RFC: 0009 §3.3, §4.3
- Rationale (short): Standard pattern (matches CADD); back-off handles sparse buckets gracefully.
Deployment¶
Primary on-device target is Apple Silicon¶
- Decided: 2026-05-20
- RFC: 0010 §3.1, §4.1
- Rationale (short): User overlap; hardware quality for this size range; signed-binary distribution maturity.
Carbon weights are not bundled; pulled from Hugging Face Hub on first run¶
- Decided: 2026-05-20
- RFC: 0010 §3.2, §4.2
- Rationale (short): Artifact size; canonical-source provenance; explicit user-initiated downloads.
Automatic updates disabled¶
- Decided: 2026-05-20
- RFC: 0010 §3.8, §4.5
- Rationale (short): Reproducibility of published results requires pinned model versions.
Runtime fails closed on network calls¶
- Decided: 2026-05-20
- RFC: 0010 §3.7
- Rationale (short): Privacy contract; silent online fallback unacceptable for personal-genome data.
Artifact provenance¶
Receipts are checksum-only in v1¶
- Decided: 2026-06-01
- RFC: 0011
- Rationale (short): Manifests, input commitments, output commitments, and checksum receipts support reproducible releases without advertising unsupported trust mechanisms.
Unsupported runtime assurance modes are out of active scope¶
- Decided: 2026-06-01
- RFC: 0011
- Rationale (short): The first paper/demo release should focus on real datasets, models, results, and terminal inference. Future runtime assurance mechanisms require a fresh RFC and implementation plan.