Skip to content

Design decisions log

A record of resolved design trade-offs across GenoLeWM. The RFCs contain the rationale for each decision; this document is the index — a single place to look up "what did we decide about X and where is the justification?"

When an RFC ships a new resolved decision, add an entry here. When a decision is amended, append a new entry (do not edit the old one); the history is part of the value.


Architecture

State encoder is Carbon-500M, frozen in Phase 1

  • Decided: 2026-05-20
  • RFC: 0002 §3.1, §4.2
  • Rationale (short): Carbon-500M is the smallest model that meets the published quality bar, fits the consumer-hardware deployment target, and unlocks single-GPU training when frozen.

State vectors are L2-normalized at the encoder

  • Decided: 2026-05-20
  • RFC: 0002 §3.5, §4.5
  • Rationale (short): Stable cosine + MSE loss combination; stable distance-based surprise calculation.

Window length is 12,288 bp (2,048 6-mer tokens)

  • Decided: 2026-05-20
  • RFC: 0002 §3.2, §4.4
  • Rationale (short): Middle ground between exon coverage and encoding cost.

Pooling is centered-mean over ± 256 tokens

  • Decided: 2026-05-20
  • RFC: 0002 §3.4, §4.1
  • Rationale (short): Edit-local; no extra parameters; outperforms global mean on edit-sensitive tasks (to be confirmed by ablation).

Action encoder uses four sub-encoders (position, type, ref, alt)

  • Decided: 2026-05-20
  • RFC: 0003 §3.4, §4.1
  • Rationale (short): Inductive bias matching structure of an edit; shared ref/alt SeqMicroEncoder enforces compositional generalization.

v1 caps len(ref) and len(alt) at 16 bp

  • Decided: 2026-05-20
  • RFC: 0003 §3.1, §3.5, §4.3
  • Rationale (short): Covers > 95% of clinically relevant short variants; SVs require separate adapter (v2 RFC).

Predictor is cross-attention Transformer (4 cross + 2 self blocks)

  • Decided: 2026-05-20
  • RFC: 0004 §3.1, §4.1
  • Rationale (short): Variable-length action sequences without arch change; cross-attention exposes structured action sub-embeddings to state.

Predictor output MLP final layer is zero-initialized

  • Decided: 2026-05-20
  • RFC: 0004 §3.4, §4.3
  • Rationale (short): Identity-at-init; predictor starts by outputting s_t, making early training stable.

Training

Loss is α · (1 − cos) + β · MSE / d_state

  • Decided: 2026-05-20
  • RFC: 0005 §3.1, §4.1
  • Rationale (short): Cosine for direction, MSE for magnitude calibration; matches LeWM recipe ported to L2-normalized embeddings.

LeJEPA regularizer is monitored-only in Phase 1

  • Decided: 2026-05-20
  • RFC: 0005 §3.2, §3.3, §4.4
  • Rationale (short): Frozen encoder → collapse impossible → regularizer not needed as training term; computed for monitoring to catch unexpected drift.

Optimizer is AdamW with β₂ = 0.95

  • Decided: 2026-05-20
  • RFC: 0005 §3.4
  • Rationale (short): Stability with small batches over high-dimensional latents; standard for JEPA training.

LR schedule is WSD (warmup-stable-decay)

  • Decided: 2026-05-20
  • RFC: 0005 §3.5, §4.3
  • Rationale (short): Phase-transition friendly; checkpoint at the end of stable phase, continue training with fresh decay schedule when LoRA is enabled.

Batch size 256, edit-balanced sampling

  • Decided: 2026-05-20
  • RFC: 0005 §3.7, §4.5, §4.6
  • Rationale (short): Matches LeWM; supports stable covariance estimation in Phase 2; per-type balance gives indels enough training signal.

Data

Reference corpus is HuggingFaceBio/carbon-pretraining-corpus

  • Decided: 2026-05-20
  • RFC: 0006 §3.1, §4.1
  • Rationale (short): In-distribution for Carbon → most reliable encoder outputs; pre-processed and tokenization-validated; public.

Edit-source mix is 40 gnomAD / 30 synthetic SNV / 20 synthetic indel / 10 ClinVar

  • Decided: 2026-05-20
  • RFC: 0006 §3.3, §4.2
  • Rationale (short): Balance of realism (gnomAD), action coverage (synthetic), and hard signal (ClinVar).

Windows overlap at 67% (stride 8,192 bp)

  • Decided: 2026-05-20
  • RFC: 0006 §3.2, §4.3
  • Rationale (short): Each position covered by ~3 windows; gives predictor multiple contexts per variant.

Three holdouts: holdout-chr (chr21), holdout-clinvar, holdout-haplotypes

  • Decided: 2026-05-20
  • RFC: 0006 §3.8, §4.5
  • Rationale (short): Clean spatial generalization (entire chromosome); clean known-pathogenic generalization (ClinVar P/LP); clean multi-edit generalization (gnomAD haplotypes).

Evaluation

VEP benchmarks mirror Carbon's published suite

  • Decided: 2026-05-20
  • RFC: 0007 §3.1, §4.1
  • Rationale (short): Direct comparability with Carbon's model card numbers.

Two scoring heads reported: surprise and displacement

  • Decided: 2026-05-20
  • RFC: 0007 §3.1.2, §4.2
  • Rationale (short): Different uses → different signals; prevents optimizing one at the cost of the other.

Rollout fidelity reported per-K with a naive baseline

  • Decided: 2026-05-20
  • RFC: 0007 §3.2, §4.3
  • Rationale (short): Catches degenerate predictors that output s_t regardless of action.

Efficiency benchmarks include Apple M3 Max

  • Decided: 2026-05-20
  • RFC: 0007 §3.3, §4.4
  • Rationale (short): Freedom-tech / personal-genome target audience skews Mac; first-class target for Phase 3 honesty.

Planning

Default solver is CEM

  • Decided: 2026-05-20
  • RFC: 0008 §3.4, §4.1
  • Rationale (short): Discrete edit space; no per-task training; fast enough on H100 to amortize per query.
  • Decided: 2026-05-20
  • RFC: 0008 §2
  • Rationale (short): Efficiency thesis of the world-model framing; pay for Carbon once, run thousands of CEM rollouts at predictor cost.

Surprise

Calibrated surprise is the published score; raw residual is also exposed

  • Decided: 2026-05-20
  • RFC: 0009 §3.5, §3.7, §4.4
  • Rationale (short): Context-aware percentile is interpretable; raw exposed for debugging and recalibration.

Calibration distribution is gnomAD common variants (AF ≥ 1%)

  • Decided: 2026-05-20
  • RFC: 0009 §3.4, §4.2
  • Rationale (short): Biology's tolerated background; appropriate null model.

Calibration buckets by (region_class, gc_bin, repeat_class) with back-off

  • Decided: 2026-05-20
  • RFC: 0009 §3.3, §4.3
  • Rationale (short): Standard pattern (matches CADD); back-off handles sparse buckets gracefully.

Deployment

Primary on-device target is Apple Silicon

  • Decided: 2026-05-20
  • RFC: 0010 §3.1, §4.1
  • Rationale (short): User overlap; hardware quality for this size range; signed-binary distribution maturity.

Carbon weights are not bundled; pulled from Hugging Face Hub on first run

  • Decided: 2026-05-20
  • RFC: 0010 §3.2, §4.2
  • Rationale (short): Artifact size; canonical-source provenance; explicit user-initiated downloads.

Automatic updates disabled

  • Decided: 2026-05-20
  • RFC: 0010 §3.8, §4.5
  • Rationale (short): Reproducibility of published results requires pinned model versions.

Runtime fails closed on network calls

  • Decided: 2026-05-20
  • RFC: 0010 §3.7
  • Rationale (short): Privacy contract; silent online fallback unacceptable for personal-genome data.

Artifact provenance

Receipts are checksum-only in v1

  • Decided: 2026-06-01
  • RFC: 0011
  • Rationale (short): Manifests, input commitments, output commitments, and checksum receipts support reproducible releases without advertising unsupported trust mechanisms.

Unsupported runtime assurance modes are out of active scope

  • Decided: 2026-06-01
  • RFC: 0011
  • Rationale (short): The first paper/demo release should focus on real datasets, models, results, and terminal inference. Future runtime assurance mechanisms require a fresh RFC and implementation plan.