Skip to content

geno_lewm.config.schema

schema

Typed configuration schema (RFC-0017 §3.2).

Frozen dataclasses for every subsystem listed in RFC-0017 §3.3. Each field has a default that matches the RFC's documented default; the docstring explains the field's purpose so the --explain flag (PR #29) can render it.

The top-level :class:GenoLeWMConfig aggregates the subsystem schemas and is the single object that every CLI command resolves to (RFC-0018 §3.2). The loader at :mod:geno_lewm.config.loader is responsible for constructing it from YAML; the schema itself is pure data.

Phase 1 stays in lock-step with the RFC field names so the trainer (#44) can drop in without reshaping the config. Unimplemented subsystems (predictor, action encoder, planner) still carry default values here so the schema validates before those modules land.

EncoderConfig dataclass

EncoderConfig(model_id: str = 'HuggingFaceBio/Carbon-500M', revision: str = 'main@deadbeef', dtype: str = 'bf16', state_layer: int = 20, pool_type: str = 'centered_mean', pool_radius: int = 8, normalize: bool = True, trust_remote_code: bool = False)

State encoder configuration (RFC-0002 §3.1, §3.8).

The Phase 1 default is Carbon-500M with bf16 weights, pinned to a specific revision so the encoder hash committed to the manifest is reproducible.

PredictorConfig dataclass

PredictorConfig(architecture: str = 'cross_attention', n_layers: int = 6, n_heads: int = 8, d_state: int = 512, d_action: int = 64, dtype: str = 'bf16')

Action-conditioned predictor (RFC-0004 §3.1).

ActionEncoderConfig dataclass

ActionEncoderConfig(d_action: int = 64, max_len: int = 16, sub_encoders: tuple[str, ...] = ('snv', 'ins', 'del', 'mnv'))

Action encoder configuration (RFC-0003 §3.4).

TrainingConfig dataclass

TrainingConfig(max_steps: int = 50, collapse_log_every_steps: int = 500)

Real Carbon-backed training launch controls.

max_steps is the configured horizon for geno-lewm-train --carbon-train. Fixture smoke runs keep using the CLI --steps control so small release-plumbing tests cannot silently define the first real training horizon.

OptimizerConfig dataclass

OptimizerConfig(name: Literal['adamw', 'sgd-momentum'] = 'adamw', lr: float = 0.0003, beta1: float = 0.9, beta2: float = 0.95, weight_decay: float = 0.1, grad_clip: float = 1.0, warmup_steps: int = 1000, schedule: Literal['wsd', 'cosine', 'constant'] = 'wsd')

Optimizer + learning-rate schedule (RFC-0005).

DataConfig dataclass

DataConfig(corpus_id: str = 'HuggingFaceBio/carbon-pretraining-corpus', corpus_revision: str = 'main@cafef00d', batch_size: int = 64, num_workers: int = 4, shuffle_buffer: int = 4096)

Data pipeline configuration (RFC-0006).

EvalConfig dataclass

EvalConfig(benchmarks: tuple[str, ...] = ('clinvar_coding', 'clinvar_noncoding', 'rollout'), smoke_variants: int = 1000)

Evaluation harness (RFC-0007).

ObservabilityConfig dataclass

ObservabilityConfig(log_level: Literal['debug', 'info', 'warn', 'error'] = 'info', redaction_strict: bool = True, wandb_project: str | None = None)

Observability sinks (RFC-0013).

RuntimeConfig dataclass

RuntimeConfig(backend: Literal['onnx', 'coreml', 'gguf', 'torch'] = 'torch', device: Literal['cpu', 'cuda', 'mps'] = 'cpu')

Runtime / deployment target (RFC-0010).

GenoLeWMConfig dataclass

GenoLeWMConfig(run_id: str = 'default', seed: int = 0, phase: Literal['phase1', 'phase2'] = 'phase1', encoder: EncoderConfig = EncoderConfig(), predictor: PredictorConfig = PredictorConfig(), action: ActionEncoderConfig = ActionEncoderConfig(), training: TrainingConfig = TrainingConfig(), optimizer: OptimizerConfig = OptimizerConfig(), data: DataConfig = DataConfig(), eval: EvalConfig = EvalConfig(), observability: ObservabilityConfig = ObservabilityConfig(), runtime: RuntimeConfig = RuntimeConfig(), deterministic: bool = False, schema_version: str = '1.0.0')

Top-level configuration object.

Every CLI command resolves to one of these. run_id is the primary key for run artifacts (${run_id}/config.resolved.yaml, ${run_id}/checkpoints/*); the trainer auto-generates one if the caller does not provide it.

The :data:schema_version field tracks the on-disk shape of config.resolved.yaml — bumps follow RFC-0014's MAJOR/MINOR rules on the config-resolution layer.

iter_top_level_field_names

iter_top_level_field_names() -> Iterator[str]

Yield the canonical top-level keys accepted by the loader.

The loader rejects any payload key not in this set via :class:geno_lewm.errors.UnknownTopLevelKeyError. Tests use this helper to assert the schema and the AC list stay in sync.

Source code in geno_lewm/config/schema.py
def iter_top_level_field_names() -> Iterator[str]:
    """Yield the canonical top-level keys accepted by the loader.

    The loader rejects any payload key not in this set via
    :class:`geno_lewm.errors.UnknownTopLevelKeyError`. Tests use this
    helper to assert the schema and the AC list stay in sync.
    """
    for f in fields(GenoLeWMConfig):
        yield f.name