00 — Overview¶

Status: Authoritative for v0.1
Companion RFC: RFC-0001
Last reviewed: 2026-06-06

Thesis¶

GenoLeWM is an action-conditioned Joint-Embedding Predictive Architecture (JEPA) over DNA. A pretrained DNA foundation model (Carbon-500M by default, frozen) supplies a state vector for a contiguous genomic window. A small trainable predictor head, conditioned on a structured genomic edit, predicts the post-edit state in the same latent space. Variant scoring, multi-edit rollout, planning, and unsupervised pathogenicity scoring all reduce to operations over that predictor.

The architectural contract is one sentence:

ŝ_{t+1} = g(s_t, a) where s_t = enc(w_ref), a = action(EditSpec), and the encoder enc is frozen.

Every other system commitment is downstream of that single equation.

Goals (v1)¶

The v0.1 paper/demo release is published, with public checkpoint, dataset snapshot, measured chr21 ClinVar evaluation, clean-machine real-inference demo, and final publication evidence. The goals below are roadmap targets unless explicitly tied to a measured release artifact; v0.1 establishes a narrow first baseline, not broad model quality.

Per-edit latent prediction at < 10% of Carbon's per-variant inference cost after caching reference embeddings.
Variant-effect prediction matching or exceeding Carbon-500M zero-shot AUROC on ClinVar coding and non-coding benchmarks.
Multi-edit haplotype rollout in latent space with cosine similarity ≥ 0.85 against held-out single-edit windows and ≥ 0.80 against held-out three-edit haplotypes (Phase 2).
Latent planning via cross-entropy search over discrete edits, returning ordered edit lists for user-specified target states (Phase 2).
Surprise-based pathogenicity scoring via per-context calibrated predictor residuals — no supervised classifier training.
On-device deployment target on Apple Silicon (M3 Max baseline) with measured release gates of single-variant scoring < 200 ms and full-VCF scoring of 100k variants < 30 minutes (Phase 3).
Artifact provenance hooks in release and demo paths — content-addressed model identifiers, input commitments, and output receipts.

Non-goals (v1)¶

Pretraining a new DNA foundation model. Carbon is the encoder.
Decoding latents back to DNA bases. Carbon does this; GenoLeWM does not.
Clinical decision support. Output is a research signal.
Protein-structure prediction. Out of scope; nucleotide-only latent.
Multi-omics fusion (RNA-seq, ATAC-seq, methylation). Deferred to v2.
Hosted inference service. Local-first; community may build hosted variants from the open weights.
Germline-edit reproductive use. Excluded by safety frame; see §06-security.

Success criteria¶

A release is shippable when, jointly:

The reference checkpoint clears all eval gates in docs/spec/07-testing-strategy.md and the per-track targets in docs/spec/08-performance-budget.md.
A user can install the runtime, score a VCF on a laptop, and produce a checksum receipt that can be checked against the released model manifest.
Every RFC in docs/rfcs/ is at status Accepted or Superseded.
Every public surface enumerated in 02-public-api.md is versioned per the policy in 09-release-and-versioning.md.
Every error in 04-error-model.md is raised with the documented exception type and observable per 05-observability.md.

Audience¶

Audience	Primary use
DNA-foundation-model researchers	predictor head on Carbon (or analogous encoder)
Bioinformatics tool builders	local-first variant scoring with reproducible receipts
Personal-genomics enthusiasts	desktop app for personal variant exploration
Reproducible-ML engineers	dataset, model, evaluation, and demo artifact provenance
Clinical-genomics researchers	fast first-pass screening tool (research only)

The system is explicitly not for clinical decision-making, embryo selection, or any human reproductive use.

How this corpus is organized¶

File	Role
`SPEC.md`	top-level index and executive summary
`SPECIFICATION.md`	synthesized canonical view (legacy entry point)
`docs/spec/00-overview.md`	thesis, goals, non-goals, success criteria (this file)
`docs/spec/01-architecture.md`	system architecture, module boundaries, data flow
`docs/spec/02-public-api.md`	public Python / CLI / runtime surface
`docs/spec/03-data-model.md`	types, schemas, on-disk formats, invariants
`docs/spec/04-error-model.md`	exception hierarchy, failure modes, recovery
`docs/spec/05-observability.md`	logging, metrics, tracing, redaction
`docs/spec/06-security.md`	threat model, trust boundaries, secrets handling
`docs/spec/07-testing-strategy.md`	test pyramid, ML-specific tests, CI gates
`docs/spec/08-performance-budget.md`	latency/throughput/memory targets, profiling
`docs/spec/09-release-and-versioning.md`	semver policy, deprecation, changelog discipline
`docs/spec/10-glossary.md`	canonical terms (also see `docs/glossary.md`)
`docs/rfcs/`	per-decision RFCs (also at `rfcs/` — root copy is canonical)
`docs/roadmap/IMPLEMENTATION.md`	issue tracker dashboard

When this document and an RFC disagree, the RFC wins. When two RFCs disagree, the higher-numbered one wins if it explicitly supersedes; otherwise file a reconciliation PR.

Out-of-scope reminders¶

No fabricated benchmarks. Numbers ship only after the full eval suite runs.
Existing v0.1 numbers are first-release measured results and negative findings; v0.2 claims need their own measured evidence.
No TBD shipped: every uncertainty is either decided or promoted to OPEN QUESTION with an owner.
No clinical claims. Every UI surface and every CLI banner carries the research-tool disclaimer.

Open questions tied to scope¶

ID	Question	Owner	Target
OQ-OVR-1	Whether to commit to non-Carbon encoders in v1 or only via community PR	core	end of Phase 1
OQ-OVR-2	Whether to add a license addendum forbidding clinical use vs README-only disclaimer	core	before v0.1.0 tag
OQ-OVR-3	Reassess hosted-API stance only if demand signal post-launch demands	core	end of Phase 3