RFC-0017: Configuration system¶
- Status: Draft
- Author(s): GenoLeWM Project
- Created: 2026-05-20
- Updated: 2026-06-02
- Depends on: RFC-0001, RFC-0012, RFC-0013, RFC-0018
- Supersedes: —
- Implementation status: Partial — closed dataclass/YAML config schema, defaults, strict unknown-key rejection, resolved-config writing, and first-experiment train/eval configs exist. Hydra-style sweeps, config diffing, and editor schema export remain open.
1. Summary¶
GenoLeWM uses a Hydra-based composable YAML configuration system with a
typed schema (Pydantic v2 or dataclass-based) and a single canonical
override syntax. This RFC specifies the configuration namespaces, the
override precedence, the schema-versioning policy, the secret-handling
rules, and the discovery mechanism (--print-config).
2. Motivation¶
A scientific ML project needs configs that are:
- Composable. Sweeping over
encoder=carbon-500mvsencoder=carbon-3bshould not require duplicating the whole training config. - Reproducible. The resolved config is part of the run artifact and is hashed into the manifest.
- Discoverable. A new contributor finds every knob without grepping argparse calls.
- Typed. A typo (
predictor.heads_n: 8vspredictor.n_heads: 8) fails at load time, not at step 10,000.
Hydra + dataclass / Pydantic schema is the established pattern; we adopt it with a small set of project conventions.
3. Specification¶
3.1 Layout¶
geno_lewm/config/
├── defaults/
│ ├── train.yaml # top-level training config
│ ├── score.yaml # top-level scoring config
│ ├── eval.yaml # top-level eval config
│ ├── plan.yaml # top-level planning config
│ ├── encoder/
│ │ ├── carbon-500m.yaml
│ │ ├── carbon-3b.yaml
│ │ └── carbon-8b.yaml
│ ├── predictor/
│ │ ├── default-bf16.yaml
│ │ ├── small-bf16.yaml
│ │ └── tiny-int8.yaml
│ ├── data/
│ │ ├── default.yaml
│ │ ├── small.yaml
│ │ └── debug.yaml
│ ├── optimizer/
│ │ ├── adamw.yaml
│ │ └── sgd-momentum.yaml
│ └── observability/
│ ├── local.yaml
│ ├── wandb.yaml
│ └── full.yaml
└── schema.py # Pydantic / dataclass schema
Top-level configs reference subconfigs via Hydra's defaults: block.
3.2 Schema¶
The schema is declared with pydantic.BaseModel (or dataclasses,
TBD by implementation; the field set is fixed regardless):
class GenoLeWMConfig(BaseModel):
run_id: str
seed: int
phase: Literal["phase1", "phase2"]
encoder: EncoderConfig
predictor: PredictorConfig
action: ActionEncoderConfig
optimizer: OptimizerConfig
data: DataConfig
eval: EvalConfig
observability: ObservabilityConfig
runtime: RuntimeConfig
deterministic: bool = False
schema_version: str = "1.0.0"
Sub-schemas mirror the RFC subsystems. Every field has a default that matches the RFC's documented default; every field has a docstring.
3.3 Namespaces¶
Top-level keys, all required, no top-level free-form fields:
| Key | Owner |
|---|---|
run_id |
trainer; auto-generated if absent |
seed |
trainer |
phase |
trainer |
encoder |
RFC-0002 |
predictor |
RFC-0004 |
action |
RFC-0003 |
optimizer |
RFC-0005 |
data |
RFC-0006 |
eval |
RFC-0007 |
observability |
RFC-0013 |
runtime |
RFC-0010 |
deterministic |
RFC-0005 |
schema_version |
RFC-0017 |
Unknown top-level keys are an error: ConfigError.UnknownTopLevelKey.
3.4 Override syntax¶
Hydra-style:
Multi-run sweeps via -m:
3.5 Resolution¶
- Load the top-level config (e.g.,
train.yaml). - Merge in the
defaults:listed subconfigs. - Apply CLI overrides.
- Validate via Pydantic; fail on type mismatch / unknown field.
- Persist resolved config to
{run_id}/config.resolved.yamland tomanifest.jsontraining.config_filefield.
3.6 Secrets¶
No secrets in YAML. Secrets enter via env vars (HF_TOKEN,
WANDB_API_KEY, etc.) referenced by name in the YAML:
The loader reads from the named env var at runtime. The resolved-config artifact records the env-var name but not its value.
3.7 Schema versioning¶
schema_version is a top-level field; loaders accept any same-MAJOR
version and warn on lower-MINOR; raise on MAJOR mismatch. The schema
itself is versioned per the policy in
docs/spec/09-release-and-versioning.md.
3.8 Discovery¶
geno-lewm-train --print-config # prints resolved YAML
geno-lewm-train --print-config-tree # prints with sources
geno-lewm-train --explain encoder.pool_radius # prints docstring
--explain reads the field docstrings from the schema and renders them
in CLI-friendly form.
3.9 No alternative config formats¶
We do not support TOML, JSON, or INI as primary config formats. YAML is the single source. JSON-Schema export of the Pydantic schema is documented as a downstream convenience.
4. Rationale and alternatives¶
4.1 Why Hydra over OmegaConf-bare or argparse?¶
Hydra adds the composition (defaults:), the sweeping (-m), and the
output-directory conventions. OmegaConf alone leaves us to invent these
ourselves. argparse alone doesn't compose at all.
4.2 Why Pydantic for the schema rather than just OmegaConf-structured?¶
Pydantic's error messages are markedly better than OmegaConf's on type mismatch, and Pydantic-models double as runtime API objects (Receipt is already specced as a dataclass; aligning the config schema with the runtime schema is a win).
4.3 Why YAML over TOML?¶
TOML's lack of merge-friendly multi-document support is the deciding factor. The Hydra defaults pattern is YAML-native.
4.4 Why a single top-level phase field?¶
The phase1 / phase2 choice gates whether L_reg is active, whether
LoRA is enabled, and a few other settings. Centralizing it prevents the
"oh I forgot to switch one of the three flags" bug.
4.5 Why prohibit unknown top-level keys?¶
Hydra otherwise silently swallows typos in top-level keys (e.g.,
encoderr: carbon-3b). A strict schema catches this at load.
5. Unresolved questions¶
- Whether to use Pydantic v2 vs dataclasses + omegaconf-structured. The field set is identical; implementation choice deferred to the first config PR.
- Whether to support env-var interpolation in YAML values (e.g.,
cache_dir: ${env:GENO_LEWM_CACHE}). Hydra supports it; we may enable selectively. - Whether to publish a JSON Schema for the config from CI for editor autocomplete.
6. Future work¶
- VS Code / nvim schema integration.
- A
geno-lewm config diff a.yaml b.yamlhelper for explaining sweep deltas. - A web-based config explorer for the desktop app's settings UI.
7. Changelog¶
- 2026-06-02 — Updated implementation status for the current closed schema loader, defaults, resolved-config writing, and first-experiment configs.
- 2026-05-20 — Initial draft.