Skip to content

GenoLeWM

Action-conditioned JEPA world model for DNA, built on top of Carbon.

CI Status Python License Typed Ruff

GenoLeWM treats genetic edits as first-class actions. A frozen DNA foundation model (Carbon-500M by default) supplies a state vector for a genomic window; a small trainable predictor head, conditioned on a structured edit, predicts the post-edit state in the same latent space:

$$ \hat s_{t+1} = g(s_t, a) \qquad s_t = \mathrm{enc}(w_{\text{ref}}) \qquad a = \mathrm{action}(\text{EditSpec}) $$

The research hypothesis is that this can support:

  • Variant-effect prediction with fewer Carbon passes once trained.
  • Multi-edit haplotype rollout in latent space.
  • Planning over edit sequences via latent MPC.
  • Surprise-based pathogenicity scoring — predictor error as a signal.
  • Local-first inference over user-provided variant files using released or locally staged model artifacts.

Where to start

If you have… Read
5 minutes this page → Quickstart
30 minutes Specification indexArchitecture
an afternoon the full RFC corpus
a contribution to land Contributing and the implementation tracker

What ships today

The repository currently ships alpha code plus the public geno-lewm-v0.1.0-r1 paper/demo artifact set. Install the Python package from source through the Quickstart until the first PyPI tag is cut.

  • Core Python surface: typed errors, privacy-aware structured logs, metrics, canonical edit specs, pure-Python edit application, ActionEncoder, Predictor, ARPredictor, surprise scoring, and local-only personal-genome importers.
  • Data and training contracts: Carbon window sampling, gnomAD and ClinVar VCF-to-Parquet prep commands, tuple-builder source-mix and holdout rules, GenoLeWMDataset, fixture smoke training, Carbon preflight, and a preflight-gated Carbon-backed trainer launcher.
  • Evaluation and release contracts: checksum manifests/receipts, geno-lewm-score, geno-lewm-verify, Carbon zero-shot baseline scoring, measured metrics aggregation, efficiency-report generation, terminal-demo transcript generation, dataset/model/paper package verifiers, Hub dry-run/publish helpers, clean-machine replay, and final publication-evidence binding.
  • Project guardrails: public API snapshot tests, duplicate-free __all__ checks, source-language linting for de-scoped trust claims, release-blocker issue references, and strict docs rendering.

The v0.1 publication evidence is public:

What is not established yet: broad model quality beyond the narrow chr21 ClinVar v0.1 slice, RFC-0004 rollout-speed closure, useful multi-edit planning behavior, clinical utility, privacy assurance, or runtime assurance beyond checksum provenance. See the roadmap and v0.2 epic #197.

Acknowledgments

  • Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero for LeWorldModel.
  • The Hugging Face Bio team, Zhongguancun Academy, TIGEM / Federico II for Carbon.
  • The CodeLeWM project for the recipe of porting LeWM to a structured symbolic domain.