Skip to content

Glossary

Cross-cutting terminology used across the GenoLeWM specification and RFCs. Maintained in sync with the SPECIFICATION and the RFCs; any term that appears in more than one RFC should appear here.


ML / architecture

Action. A structured genetic edit (SNV, indel, MNV, INDEL, structural variant) passed to the predictor as a first-class input. The canonical type is EditSpec (RFC-0003).

Action embedding. The fixed-size vector representation of an action, produced by the action encoder. Dimension d_action = 512 in v1.

Action encoder. The trainable module that maps an EditSpec (or RelEdit) to an action embedding. See RFC-0003.

Autoregressive predictor (ARPredictor). The wrapper around the predictor that unrolls it step-by-step over a sequence of actions, with KV caching, for multi-edit haplotype rollout and planning. See RFC-0004 §3.3.

Carbon. The DNA foundation model family released by HuggingFaceBio (500M / 3B / 8B). Used as GenoLeWM's state encoder. See RFC-0002.

Cross-attention predictor. The default predictor architecture: 4 cross-attention blocks (state ↔ action) plus 2 self-attention blocks on the fused sequence. See RFC-0004 §3.1.

JEPA. Joint-Embedding Predictive Architecture. A class of models that predict in representation space rather than input space.

LeJEPA regularizer. An isotropic-Gaussian regularizer on encoder output distributions that prevents collapse. Live in Phase 2 only. See RFC-0005 §3.2.

LeWorldModel (LeWM). The reference architecture and training recipe (Maes et al., 2026) for stable end-to-end JEPA world models. The basis for GenoLeWM's training recipe.

Predictor. The trainable module that maps (state, action) to predicted next-state in latent space. See RFC-0004.

State. A vector embedding of a contiguous DNA window produced by the state encoder (Carbon). Dimension d_state = 1024 for Carbon-500M.

State encoder. Carbon (frozen by default in v1). See RFC-0002.

Surprise. The predictor's residual: the distance between predicted post-edit latent and the encoder's actual post-edit latent. Calibrated to a per-context percentile to produce a pathogenicity score. See RFC-0009.

Genomics

bp. Base pairs. A unit of DNA length.

ClinVar. NCBI's public database of human variants with clinical significance labels.

Coding / non-coding. Regions of DNA that do (coding) or do not (non-coding) translate to protein.

Edit / variant. A change to a reference DNA sequence. Used interchangeably; "edit" is the action-conditioned framing, "variant" is the genomics framing.

EditSpec. The canonical edit type: (chrom, pos, ref, alt, edit_type). See RFC-0003.

EditType. Enumeration: {SNV, INS, DEL, MNV, INDEL, SV}.

gnomAD. Genome Aggregation Database. The reference catalog of human genetic variation, used both for biological-prior edit sampling during training and for the surprise calibration distribution.

Haplotype. A coordinated set of variants on the same chromosome copy. In GenoLeWM, a multi-edit input to the predictor.

Indel. Insertion or deletion of one or more bases.

MNV. Multi-nucleotide variant: a substitution of equal-length ref and alt, both > 1 bp.

Pathogenic / Likely pathogenic (P/LP). ClinVar classes for variants asserted to cause disease (P) or strongly suspected to (LP).

RelEdit. Window-relative form of an EditSpec, with rel_pos as the offset within the window in bp. See RFC-0003 §3.3.

SNV. Single-nucleotide variant: a substitution of one base for another.

SV. Structural variant: large-scale rearrangement (inversion, duplication, large indel, translocation). Not supported in v1; deferred to v2.

TraitGym. A curated benchmark of trait-associated variants for variant-effect prediction.

VCF. Variant Call Format. The standard file format for representing genetic variants.

Window. A contiguous DNA region passed to the state encoder. Default length 12,288 bp (2,048 Carbon 6-mer tokens).

Operations

Calibration table. The per-context empirical CDFs of σ_raw over gnomAD common variants. The v0.1 model package ships calibration.parquet, but population-stratified calibration validity is not established by the first release. See RFC-0009 §3.4.

Cache (window cache). The on-disk Parquet store of pre-computed reference-window embeddings. Content-addressed by (window_hash, encoder_hash, state_layer, pool_type, pool_radius). See RFC-0002 §3.6.

Encoder hash. SHA-256 of the encoder weights file. Part of every cache key and checksum receipt.

Holdout. A set of data excluded from training and reserved for evaluation. Three holdouts: holdout-chr (chr21), holdout-clinvar, holdout-haplotypes. See RFC-0006 §3.8.

Manifest. The canonical JSON document describing a GenoLeWM checkpoint's identity, configuration, and provenance. Its hash is the model_id. See RFC-0011 §3.7.

Receipt. The JSON document that release scoring paths can emit to bind model identity, input commitment, and output commitment. See RFC-0011.

Tuple builder. The data-pipeline component that produces (w_ref, a, w_alt) training tuples. See RFC-0006 §3.5.

Artifact provenance

Checksum receipt. The receipt mode currently supported by GenoLeWM. It binds the model manifest identity, input commitment, output commitment, and runtime metadata; it is not a model-quality or runtime-assurance guarantee.

Content addressing. Identifying data (weights, inputs, outputs) by the cryptographic hash of their canonical serialization, not by name or location.

Input commitment. SHA-256 hash of the canonical serialization of an inference's inputs.

model_id. The SHA-256 of a GenoLeWM checkpoint's manifest. Globally identifies a specific released checkpoint artifact set.

Phases

Current public status: alpha implementation with a completed v0.1 paper/demo release. Phase names below describe the roadmap, not broader model-quality evidence.

Phase 0 (Design, complete). Spec and RFC bootstrap.

Phase 1 (MVP). Carbon-500M frozen, SNVs only, ClinVar coding/non- coding eval. See ROADMAP.

Phase 2 (Full edits). Adds indels, MNVs, LoRA-adapted encoder, LeJEPA regularizer, planning, calibrated surprise. Full eval suite.

Phase 3 (On-device). Export pipeline, quantization, desktop app skeleton.