Design decisions log¶

A record of resolved design trade-offs across GenoLeWM. The RFCs contain the rationale for each decision; this document is the index — a single place to look up "what did we decide about X and where is the justification?"

When an RFC ships a new resolved decision, add an entry here. When a decision is amended, append a new entry (do not edit the old one); the history is part of the value.

Architecture¶

State encoder is Carbon-500M, frozen in Phase 1¶

Decided: 2026-05-20
RFC: 0002 §3.1, §4.2
Rationale (short): Carbon-500M is the smallest model that meets the published quality bar, fits the consumer-hardware deployment target, and unlocks single-GPU training when frozen.

State vectors are L2-normalized at the encoder¶

Decided: 2026-05-20
RFC: 0002 §3.5, §4.5
Rationale (short): Stable cosine + MSE loss combination; stable distance-based surprise calculation.

Window length is 12,288 bp (2,048 6-mer tokens)¶

Decided: 2026-05-20
RFC: 0002 §3.2, §4.4
Rationale (short): Middle ground between exon coverage and encoding cost.

Pooling is centered-mean over ± 256 tokens¶

Decided: 2026-05-20
RFC: 0002 §3.4, §4.1
Rationale (short): Edit-local; no extra parameters; outperforms global mean on edit-sensitive tasks (to be confirmed by ablation).

Action encoder uses four sub-encoders (position, type, ref, alt)¶

Decided: 2026-05-20
RFC: 0003 §3.4, §4.1
Rationale (short): Inductive bias matching structure of an edit; shared ref/alt SeqMicroEncoder enforces compositional generalization.

v1 caps `len(ref)` and `len(alt)` at 16 bp¶

Decided: 2026-05-20
RFC: 0003 §3.1, §3.5, §4.3
Rationale (short): Covers > 95% of clinically relevant short variants; SVs require separate adapter (v2 RFC).

Predictor is cross-attention Transformer (4 cross + 2 self blocks)¶

Decided: 2026-05-20
RFC: 0004 §3.1, §4.1
Rationale (short): Variable-length action sequences without arch change; cross-attention exposes structured action sub-embeddings to state.

Predictor output MLP final layer is zero-initialized¶

Decided: 2026-05-20
RFC: 0004 §3.4, §4.3
Rationale (short): Identity-at-init; predictor starts by outputting s_t, making early training stable.

Training¶

Loss is `α · (1 − cos) + β · MSE / d_state`¶

Decided: 2026-05-20
RFC: 0005 §3.1, §4.1
Rationale (short): Cosine for direction, MSE for magnitude calibration; matches LeWM recipe ported to L2-normalized embeddings.

LeJEPA regularizer is monitored-only in Phase 1¶

Decided: 2026-05-20
RFC: 0005 §3.2, §3.3, §4.4
Rationale (short): Frozen encoder → collapse impossible → regularizer not needed as training term; computed for monitoring to catch unexpected drift.

Optimizer is AdamW with `β₂ = 0.95`¶

Decided: 2026-05-20
RFC: 0005 §3.4
Rationale (short): Stability with small batches over high-dimensional latents; standard for JEPA training.

LR schedule is WSD (warmup-stable-decay)¶

Decided: 2026-05-20
RFC: 0005 §3.5, §4.3
Rationale (short): Phase-transition friendly; checkpoint at the end of stable phase, continue training with fresh decay schedule when LoRA is enabled.

Batch size 256, edit-balanced sampling¶

Decided: 2026-05-20
RFC: 0005 §3.7, §4.5, §4.6
Rationale (short): Matches LeWM; supports stable covariance estimation in Phase 2; per-type balance gives indels enough training signal.

Data¶

Reference corpus is `HuggingFaceBio/carbon-pretraining-corpus`¶

Decided: 2026-05-20
RFC: 0006 §3.1, §4.1
Rationale (short): In-distribution for Carbon → most reliable encoder outputs; pre-processed and tokenization-validated; public.

Edit-source mix is 40 gnomAD / 30 synthetic SNV / 20 synthetic indel / 10 ClinVar¶

Decided: 2026-05-20
RFC: 0006 §3.3, §4.2
Rationale (short): Balance of realism (gnomAD), action coverage (synthetic), and hard signal (ClinVar).

Windows overlap at 67% (stride 8,192 bp)¶

Decided: 2026-05-20
RFC: 0006 §3.2, §4.3
Rationale (short): Each position covered by ~3 windows; gives predictor multiple contexts per variant.

Three holdouts: `holdout-chr` (chr21), `holdout-clinvar`, `holdout-haplotypes`¶

Decided: 2026-05-20
RFC: 0006 §3.8, §4.5
Rationale (short): Clean spatial generalization (entire chromosome); clean known-pathogenic generalization (ClinVar P/LP); clean multi-edit generalization (gnomAD haplotypes).

Evaluation¶

VEP benchmarks mirror Carbon's published suite¶

Decided: 2026-05-20
RFC: 0007 §3.1, §4.1
Rationale (short): Direct comparability with Carbon's model card numbers.

Two scoring heads reported: surprise and displacement¶

Decided: 2026-05-20
RFC: 0007 §3.1.2, §4.2
Rationale (short): Different uses → different signals; prevents optimizing one at the cost of the other.

Rollout fidelity reported per-K with a naive baseline¶

Decided: 2026-05-20
RFC: 0007 §3.2, §4.3
Rationale (short): Catches degenerate predictors that output s_t regardless of action.

Efficiency benchmarks include Apple M3 Max¶

Decided: 2026-05-20
RFC: 0007 §3.3, §4.4
Rationale (short): Freedom-tech / personal-genome target audience skews Mac; first-class target for Phase 3 honesty.

Planning¶

Default solver is CEM¶

Decided: 2026-05-20
RFC: 0008 §3.4, §4.1
Rationale (short): Discrete edit space; no per-task training; fast enough on H100 to amortize per query.

Planning never calls Carbon during search¶

Decided: 2026-05-20
RFC: 0008 §2
Rationale (short): Efficiency thesis of the world-model framing; pay for Carbon once, run thousands of CEM rollouts at predictor cost.

Surprise¶

Calibrated surprise is the published score; raw residual is also exposed¶

Decided: 2026-05-20
RFC: 0009 §3.5, §3.7, §4.4
Rationale (short): Context-aware percentile is interpretable; raw exposed for debugging and recalibration.

Calibration distribution is gnomAD common variants (AF ≥ 1%)¶

Decided: 2026-05-20
RFC: 0009 §3.4, §4.2
Rationale (short): Biology's tolerated background; appropriate null model.

Calibration buckets by `(region_class, gc_bin, repeat_class)` with back-off¶

Decided: 2026-05-20
RFC: 0009 §3.3, §4.3
Rationale (short): Standard pattern (matches CADD); back-off handles sparse buckets gracefully.

Deployment¶

Primary on-device target is Apple Silicon¶

Decided: 2026-05-20
RFC: 0010 §3.1, §4.1
Rationale (short): User overlap; hardware quality for this size range; signed-binary distribution maturity.

Carbon weights are not bundled; pulled from Hugging Face Hub on first run¶

Decided: 2026-05-20
RFC: 0010 §3.2, §4.2
Rationale (short): Artifact size; canonical-source provenance; explicit user-initiated downloads.

Automatic updates disabled¶

Decided: 2026-05-20
RFC: 0010 §3.8, §4.5
Rationale (short): Reproducibility of published results requires pinned model versions.

Runtime fails closed on network calls¶

Decided: 2026-05-20
RFC: 0010 §3.7
Rationale (short): Privacy contract; silent online fallback unacceptable for personal-genome data.

Artifact provenance¶

Receipts are checksum-only in v1¶

Decided: 2026-06-01
RFC: 0011
Rationale (short): Manifests, input commitments, output commitments, and checksum receipts support reproducible releases without advertising unsupported trust mechanisms.

Unsupported runtime assurance modes are out of active scope¶

Decided: 2026-06-01
RFC: 0011
Rationale (short): The first paper/demo release should focus on real datasets, models, results, and terminal inference. Future runtime assurance mechanisms require a fresh RFC and implementation plan.