Specification¶
- Version: 0.1.0.dev0
- Status: Alpha implementation. The reference infrastructure layer is implemented (errors, observability, provenance primitives, action specs, verify CLI), along with the optional-runtime base predictor and loss modules, the edit-balanced training sampler, collapse diagnostics, the Carbon corpus window sampler, and the planning cost/sampler primitives; trainer / autoregressive rollout / CEM / eval / deploy are upcoming.
This file is the entry point into the GenoLeWM specification corpus. Detailed content lives in two trees:
- This section (
docs/spec/) — eleven canonical sections, normative for v0.1. - The per-decision RFC corpus.
Executive summary¶
GenoLeWM is an action-conditioned Joint-Embedding Predictive Architecture (JEPA) over DNA. A pretrained DNA foundation model (Carbon-500M by default, frozen) supplies a state vector for a contiguous genomic window. A small trainable predictor head, conditioned on a structured genomic edit, predicts the post-edit state in the same latent space.
That single equation unlocks: variant-effect prediction at a fraction of Carbon's cost, multi-edit haplotype rollout, planning via CEM in latent space, surprise-based pathogenicity scoring, on-device deployment on consumer hardware, and checksum-based artifact provenance for releases.
Spec corpus¶
| # | File | Subject |
|---|---|---|
| 00 | overview | thesis, goals, non-goals, success criteria |
| 01 | architecture | module boundaries, runtime flows, invariants |
| 02 | public-api | Python and CLI surface, stability classes |
| 03 | data-model | types, schemas, on-disk formats |
| 04 | error-model | exception hierarchy, failure modes, exit codes |
| 05 | observability | logging, metrics, tracing, redaction |
| 06 | security | threat model, trust boundaries, secrets |
| 07 | testing-strategy | test pyramid, ML-specific tests, CI gates |
| 08 | performance-budget | latency / throughput / memory targets |
| 09 | release-and-versioning | semver, deprecation, changelog discipline |
| 10 | glossary | canonical terms |
RFC corpus¶
19 RFCs covering the load-bearing decisions:
- RFC-0001 — scope.
- RFC-0002 — state encoder (Carbon).
- RFC-0003 — action encoder (genomic edits).
- RFC-0004 — predictor (cross-attention Transformer).
- RFC-0005 — training objective (cosine + MSE; LeJEPA in Phase 2).
- RFC-0006 — data pipeline (corpus, edit mix, holdouts).
- RFC-0007 — evaluation suite (VEP, rollout, efficiency).
- RFC-0008 — planning (CEM).
- RFC-0009 — surprise scoring (calibrated per context).
- RFC-0010 — deployment (Apple Silicon, int4 / int8).
- RFC-0011 — artifact provenance and checksum receipts.
- RFC-0012 — error taxonomy.
- RFC-0013 — observability and redaction.
- RFC-0014 — API stability policy.
- RFC-0015 — testing and CI gates.
- RFC-0016 — performance budget.
- RFC-0017 — configuration system.
- RFC-0018 — CLI design.
- RFC-0019 — reference desktop app skeleton.
The full index lives in the repository at
rfcs/README.md.
Conflict resolution¶
If this index and a section disagree, the section wins.
If a section and an RFC disagree, the RFC wins.
If two RFCs disagree without an explicit Supersedes relationship, file
a reconciliation PR.
What is not specified here¶
- Project history, contributor list, license text —
LICENSE, the README,CONTRIBUTING. - Implementation roadmap and phase exit criteria —
ROADMAP.md. - Operational security disclosure —
SECURITY. - Open user-data privacy posture —
PRIVACY. - Implementation tracker —
docs/roadmap/IMPLEMENTATION.md.