`geno_lewm.config.schema`¶

schema ¶

Typed configuration schema.

Frozen dataclasses for every subsystem. Each field has a default and a docstring explaining the field's purpose so the --explain flag can render it.

The top-level :class:GenoLeWMConfig aggregates the subsystem schemas and is the single object that every CLI command resolves to. The loader at :mod:geno_lewm.config.loader is responsible for constructing it from YAML; the schema itself is pure data.

Phase 1 stays in lock-step with the configuration field names so the trainer (#44) can drop in without reshaping the config. Unimplemented subsystems (predictor, action encoder, planner) still carry default values here so the schema validates before those modules land.

EncoderConfig `dataclass` ¶

EncoderConfig(model_id: str = 'HuggingFaceBio/Carbon-500M', revision: str = 'main@deadbeef', dtype: str = 'bf16', state_layer: int = 20, pool_type: str = 'centered_mean', pool_radius: int = 8, normalize: bool = True, state_contract_version: Literal['legacy_raw_v1', 'l2_normalized_v2'] = 'l2_normalized_v2', trust_remote_code: bool = False)

State encoder configuration (encoder contract).

The Phase 1 default is Carbon-500M with bf16 weights, pinned to a specific revision so the encoder hash committed to the manifest is reproducible.

PredictorConfig `dataclass` ¶

PredictorConfig(architecture: str = 'cross_attention', n_layers: int = 6, n_heads: int = 8, d_state: int = 512, d_action: int = 64, dtype: str = 'bf16')

Action-conditioned predictor (predictor contract).

ActionEncoderConfig `dataclass` ¶

ActionEncoderConfig(d_action: int = 64, max_len: int = 16, sub_encoders: tuple[str, ...] = ('snv', 'ins', 'del', 'mnv'))

Action encoder configuration (edit contract).

TrainingConfig `dataclass` ¶

TrainingConfig(max_steps: int = 50, collapse_log_every_steps: int = 500)

Real Carbon-backed training launch controls.

max_steps is the configured horizon for geno-lewm-train --carbon-train. Fixture smoke runs keep using the CLI --steps control so small release-plumbing tests cannot silently define the first real training horizon.

OptimizerConfig `dataclass` ¶

OptimizerConfig(name: Literal['adamw', 'sgd-momentum'] = 'adamw', lr: float = 0.0003, beta1: float = 0.9, beta2: float = 0.95, weight_decay: float = 0.1, grad_clip: float = 1.0, warmup_steps: int = 1000, schedule: Literal['wsd', 'cosine', 'constant'] = 'wsd')

Optimizer + learning-rate schedule (training contract).

DataConfig `dataclass` ¶

DataConfig(corpus_id: str = 'HuggingFaceBio/carbon-pretraining-corpus', corpus_revision: str = 'main@cafef00d', batch_size: int = 64, num_workers: int = 4, shuffle_buffer: int = 4096)

Data pipeline configuration (data-pipeline contract).

EvalConfig `dataclass` ¶

EvalConfig(benchmarks: tuple[str, ...] = ('clinvar_coding', 'clinvar_noncoding', 'rollout'), smoke_variants: int = 1000)

Evaluation harness (evaluation-suite contract).

ObservabilityConfig `dataclass` ¶

ObservabilityConfig(log_level: Literal['debug', 'info', 'warn', 'error'] = 'info', redaction_strict: bool = True, wandb_project: str | None = None)

Observability sinks (observability contract).

RuntimeConfig `dataclass` ¶

RuntimeConfig(backend: Literal['onnx', 'coreml', 'gguf', 'torch'] = 'torch', device: Literal['cpu', 'cuda', 'mps'] = 'cpu')

Runtime / deployment target (runtime contract).

GenoLeWMConfig `dataclass` ¶

GenoLeWMConfig(run_id: str = 'default', seed: int = 0, phase: Literal['phase1', 'phase2'] = 'phase1', encoder: EncoderConfig = EncoderConfig(), predictor: PredictorConfig = PredictorConfig(), action: ActionEncoderConfig = ActionEncoderConfig(), training: TrainingConfig = TrainingConfig(), optimizer: OptimizerConfig = OptimizerConfig(), data: DataConfig = DataConfig(), eval: EvalConfig = EvalConfig(), observability: ObservabilityConfig = ObservabilityConfig(), runtime: RuntimeConfig = RuntimeConfig(), deterministic: bool = False, schema_version: Literal['1.0.0', '1.1.0'] = '1.1.0')

Top-level configuration object.

Every CLI command resolves to one of these. run_id is the primary key for run artifacts (${run_id}/config.resolved.yaml, ${run_id}/checkpoints/*); the trainer auto-generates one if the caller does not provide it.

The :data:schema_version field tracks the on-disk shape of config.resolved.yaml — bumps follow the public API contract's MAJOR/MINOR rules on the config-resolution layer.

iter_top_level_field_names ¶

iter_top_level_field_names() -> Iterator[str]

Yield the canonical top-level keys accepted by the loader.

The loader rejects any payload key not in this set via :class:geno_lewm.errors.UnknownTopLevelKeyError. Tests use this helper to assert the schema and the AC list stay in sync.

Source code in geno_lewm/config/schema.py

def iter_top_level_field_names() -> Iterator[str]:
    """Yield the canonical top-level keys accepted by the loader.

    The loader rejects any payload key not in this set via
    :class:`geno_lewm.errors.UnknownTopLevelKeyError`. Tests use this
    helper to assert the schema and the AC list stay in sync.
    """
    for f in fields(GenoLeWMConfig):
        yield f.name

geno_lewm.config.schema¶

schema ¶

EncoderConfig dataclass ¶

PredictorConfig dataclass ¶

ActionEncoderConfig dataclass ¶

TrainingConfig dataclass ¶

OptimizerConfig dataclass ¶

DataConfig dataclass ¶

EvalConfig dataclass ¶

ObservabilityConfig dataclass ¶

RuntimeConfig dataclass ¶

GenoLeWMConfig dataclass ¶