geno_lewm.config.schema¶
schema
¶
Typed configuration schema (RFC-0017 §3.2).
Frozen dataclasses for every subsystem listed in RFC-0017 §3.3. Each
field has a default that matches the RFC's documented default; the
docstring explains the field's purpose so the --explain flag
(PR #29) can render it.
The top-level :class:GenoLeWMConfig aggregates the subsystem schemas
and is the single object that every CLI command resolves to (RFC-0018
§3.2). The loader at :mod:geno_lewm.config.loader is responsible for
constructing it from YAML; the schema itself is pure data.
Phase 1 stays in lock-step with the RFC field names so the trainer (#44) can drop in without reshaping the config. Unimplemented subsystems (predictor, action encoder, planner) still carry default values here so the schema validates before those modules land.
EncoderConfig
dataclass
¶
EncoderConfig(model_id: str = 'HuggingFaceBio/Carbon-500M', revision: str = 'main@deadbeef', dtype: str = 'bf16', state_layer: int = 20, pool_type: str = 'centered_mean', pool_radius: int = 8, normalize: bool = True, trust_remote_code: bool = False)
State encoder configuration (RFC-0002 §3.1, §3.8).
The Phase 1 default is Carbon-500M with bf16 weights, pinned to a specific revision so the encoder hash committed to the manifest is reproducible.
PredictorConfig
dataclass
¶
PredictorConfig(architecture: str = 'cross_attention', n_layers: int = 6, n_heads: int = 8, d_state: int = 512, d_action: int = 64, dtype: str = 'bf16')
Action-conditioned predictor (RFC-0004 §3.1).
ActionEncoderConfig
dataclass
¶
ActionEncoderConfig(d_action: int = 64, max_len: int = 16, sub_encoders: tuple[str, ...] = ('snv', 'ins', 'del', 'mnv'))
Action encoder configuration (RFC-0003 §3.4).
TrainingConfig
dataclass
¶
Real Carbon-backed training launch controls.
max_steps is the configured horizon for geno-lewm-train
--carbon-train. Fixture smoke runs keep using the CLI --steps
control so small release-plumbing tests cannot silently define the
first real training horizon.
OptimizerConfig
dataclass
¶
OptimizerConfig(name: Literal['adamw', 'sgd-momentum'] = 'adamw', lr: float = 0.0003, beta1: float = 0.9, beta2: float = 0.95, weight_decay: float = 0.1, grad_clip: float = 1.0, warmup_steps: int = 1000, schedule: Literal['wsd', 'cosine', 'constant'] = 'wsd')
Optimizer + learning-rate schedule (RFC-0005).
DataConfig
dataclass
¶
DataConfig(corpus_id: str = 'HuggingFaceBio/carbon-pretraining-corpus', corpus_revision: str = 'main@cafef00d', batch_size: int = 64, num_workers: int = 4, shuffle_buffer: int = 4096)
Data pipeline configuration (RFC-0006).
EvalConfig
dataclass
¶
EvalConfig(benchmarks: tuple[str, ...] = ('clinvar_coding', 'clinvar_noncoding', 'rollout'), smoke_variants: int = 1000)
Evaluation harness (RFC-0007).
ObservabilityConfig
dataclass
¶
ObservabilityConfig(log_level: Literal['debug', 'info', 'warn', 'error'] = 'info', redaction_strict: bool = True, wandb_project: str | None = None)
Observability sinks (RFC-0013).
RuntimeConfig
dataclass
¶
RuntimeConfig(backend: Literal['onnx', 'coreml', 'gguf', 'torch'] = 'torch', device: Literal['cpu', 'cuda', 'mps'] = 'cpu')
Runtime / deployment target (RFC-0010).
GenoLeWMConfig
dataclass
¶
GenoLeWMConfig(run_id: str = 'default', seed: int = 0, phase: Literal['phase1', 'phase2'] = 'phase1', encoder: EncoderConfig = EncoderConfig(), predictor: PredictorConfig = PredictorConfig(), action: ActionEncoderConfig = ActionEncoderConfig(), training: TrainingConfig = TrainingConfig(), optimizer: OptimizerConfig = OptimizerConfig(), data: DataConfig = DataConfig(), eval: EvalConfig = EvalConfig(), observability: ObservabilityConfig = ObservabilityConfig(), runtime: RuntimeConfig = RuntimeConfig(), deterministic: bool = False, schema_version: str = '1.0.0')
Top-level configuration object.
Every CLI command resolves to one of these. run_id is the
primary key for run artifacts (${run_id}/config.resolved.yaml,
${run_id}/checkpoints/*); the trainer auto-generates one if the
caller does not provide it.
The :data:schema_version field tracks the on-disk shape of
config.resolved.yaml — bumps follow RFC-0014's MAJOR/MINOR
rules on the config-resolution layer.
iter_top_level_field_names
¶
Yield the canonical top-level keys accepted by the loader.
The loader rejects any payload key not in this set via
:class:geno_lewm.errors.UnknownTopLevelKeyError. Tests use this
helper to assert the schema and the AC list stay in sync.