Skip to content

geno_lewm.config

config

Typed configuration surface (RFC-0017).

The schema is declared as nested frozen dataclasses (RFC-0017 §3.2 left the choice between Pydantic v2 and dataclasses open; we chose dataclasses to keep base runtime deps minimal — the package's only new runtime dep introduced by this module is :mod:yaml).

Public surface:

  • :class:GenoLeWMConfig — top-level dataclass.
  • :class:EncoderConfig, :class:PredictorConfig, :class:ActionEncoderConfig, :class:OptimizerConfig, :class:DataConfig, :class:EvalConfig, :class:ObservabilityConfig, :class:RuntimeConfig, :class:TrainingConfig — per-subsystem schemas.
  • :func:load_config — load YAML + validate; raises :class:UnknownTopLevelKeyError on unknown top-level keys (RFC-0017 §3.3).
  • :func:write_resolved_config — emit the resolved config as canonical YAML so the run directory is auditable (RFC-0017 §3.5).
  • :func:config_to_dict — pure dict view of a config tree (used by the manifest writer and the --print-config flag in PR #29).
  • :func:describe_field — schema introspection for the --explain flag (PR #29). Returns the field's docstring snippet, type, and default value.
  • :data:DEFAULTS_DIR — directory containing the canonical YAML templates for train, score, eval, and plan commands.

ActionEncoderConfig dataclass

ActionEncoderConfig(d_action: int = 64, max_len: int = 16, sub_encoders: tuple[str, ...] = ('snv', 'ins', 'del', 'mnv'))

Action encoder configuration (RFC-0003 §3.4).

DataConfig dataclass

DataConfig(corpus_id: str = 'HuggingFaceBio/carbon-pretraining-corpus', corpus_revision: str = 'main@cafef00d', batch_size: int = 64, num_workers: int = 4, shuffle_buffer: int = 4096)

Data pipeline configuration (RFC-0006).

EncoderConfig dataclass

EncoderConfig(model_id: str = 'HuggingFaceBio/Carbon-500M', revision: str = 'main@deadbeef', dtype: str = 'bf16', state_layer: int = 20, pool_type: str = 'centered_mean', pool_radius: int = 8, normalize: bool = True, trust_remote_code: bool = False)

State encoder configuration (RFC-0002 §3.1, §3.8).

The Phase 1 default is Carbon-500M with bf16 weights, pinned to a specific revision so the encoder hash committed to the manifest is reproducible.

EvalConfig dataclass

EvalConfig(benchmarks: tuple[str, ...] = ('clinvar_coding', 'clinvar_noncoding', 'rollout'), smoke_variants: int = 1000)

Evaluation harness (RFC-0007).

GenoLeWMConfig dataclass

GenoLeWMConfig(run_id: str = 'default', seed: int = 0, phase: Literal['phase1', 'phase2'] = 'phase1', encoder: EncoderConfig = EncoderConfig(), predictor: PredictorConfig = PredictorConfig(), action: ActionEncoderConfig = ActionEncoderConfig(), training: TrainingConfig = TrainingConfig(), optimizer: OptimizerConfig = OptimizerConfig(), data: DataConfig = DataConfig(), eval: EvalConfig = EvalConfig(), observability: ObservabilityConfig = ObservabilityConfig(), runtime: RuntimeConfig = RuntimeConfig(), deterministic: bool = False, schema_version: str = '1.0.0')

Top-level configuration object.

Every CLI command resolves to one of these. run_id is the primary key for run artifacts (${run_id}/config.resolved.yaml, ${run_id}/checkpoints/*); the trainer auto-generates one if the caller does not provide it.

The :data:schema_version field tracks the on-disk shape of config.resolved.yaml — bumps follow RFC-0014's MAJOR/MINOR rules on the config-resolution layer.

ObservabilityConfig dataclass

ObservabilityConfig(log_level: Literal['debug', 'info', 'warn', 'error'] = 'info', redaction_strict: bool = True, wandb_project: str | None = None)

Observability sinks (RFC-0013).

OptimizerConfig dataclass

OptimizerConfig(name: Literal['adamw', 'sgd-momentum'] = 'adamw', lr: float = 0.0003, beta1: float = 0.9, beta2: float = 0.95, weight_decay: float = 0.1, grad_clip: float = 1.0, warmup_steps: int = 1000, schedule: Literal['wsd', 'cosine', 'constant'] = 'wsd')

Optimizer + learning-rate schedule (RFC-0005).

PredictorConfig dataclass

PredictorConfig(architecture: str = 'cross_attention', n_layers: int = 6, n_heads: int = 8, d_state: int = 512, d_action: int = 64, dtype: str = 'bf16')

Action-conditioned predictor (RFC-0004 §3.1).

RuntimeConfig dataclass

RuntimeConfig(backend: Literal['onnx', 'coreml', 'gguf', 'torch'] = 'torch', device: Literal['cpu', 'cuda', 'mps'] = 'cpu')

Runtime / deployment target (RFC-0010).

TrainingConfig dataclass

TrainingConfig(max_steps: int = 50, collapse_log_every_steps: int = 500)

Real Carbon-backed training launch controls.

max_steps is the configured horizon for geno-lewm-train --carbon-train. Fixture smoke runs keep using the CLI --steps control so small release-plumbing tests cannot silently define the first real training horizon.

config_to_dict

config_to_dict(cfg: GenoLeWMConfig) -> dict[str, Any]

Return a plain dict view of cfg for serialization.

Source code in geno_lewm/config/loader.py
def config_to_dict(cfg: GenoLeWMConfig) -> dict[str, Any]:
    """Return a plain dict view of ``cfg`` for serialization."""
    result = _asdict_with_tuples(cfg)
    assert isinstance(result, dict)
    return result

describe_field

describe_field(dotted_key: str) -> dict[str, Any]

Return {type, default, doc} for dotted_key (e.g. encoder.dtype).

Raises :class:MissingConfigError if the key is not in the schema.

Source code in geno_lewm/config/loader.py
def describe_field(dotted_key: str) -> dict[str, Any]:
    """Return ``{type, default, doc}`` for ``dotted_key`` (e.g. ``encoder.dtype``).

    Raises :class:`MissingConfigError` if the key is not in the schema.
    """
    parts = dotted_key.split(".")
    if not parts or not parts[0]:
        raise InputError("--explain key must not be empty", details={"key": dotted_key})

    cls: type = GenoLeWMConfig
    field_obj: dataclasses.Field[Any] | None = None
    parent_doc = cls.__doc__ or ""

    for i, part in enumerate(parts):
        if not is_dataclass(cls):
            raise MissingConfigError(
                "--explain: path leaves the schema before resolving",
                details={"key": dotted_key, "where": ".".join(parts[: i + 1])},
            )
        try:
            field_obj = next(f for f in fields(cls) if f.name == part)
        except StopIteration as exc:
            known = [f.name for f in fields(cls)]
            raise MissingConfigError(
                "--explain: key not found in schema",
                details={"key": dotted_key, "where": part, "known": sorted(known)},
            ) from exc
        next_type = get_type_hints(cls).get(part)
        if is_dataclass(field_obj.type) and isinstance(field_obj.type, type):
            cls = field_obj.type
            parent_doc = cls.__doc__ or ""
            continue
        if next_type is not None and isinstance(next_type, type) and is_dataclass(next_type):
            cls = next_type
            parent_doc = cls.__doc__ or ""
            continue
        # Leaf field reached.
        return _format_field_info(field_obj, parent_doc=parent_doc, type_hint=next_type)

    if field_obj is None:  # pragma: no cover - guarded above
        raise InputError("--explain key did not resolve to a field", details={"key": dotted_key})
    return _format_field_info(field_obj, parent_doc=parent_doc, type_hint=None)

load_config

load_config(source: Path | str | Mapping[str, Any]) -> GenoLeWMConfig

Load + validate a config payload; return a frozen :class:GenoLeWMConfig.

source may be:

  • A :class:Path (or str) — read the file as YAML.
  • A :class:Mapping — treat it as the already-parsed payload (used by --set override merging in PR #29 and by the unit tests).

Validation:

  • Unknown top-level keys → :class:UnknownTopLevelKeyError.
  • Missing required subsystem keys → :class:MissingConfigError.
  • Wrong value type on any field → :class:ConfigError.
Source code in geno_lewm/config/loader.py
def load_config(source: Path | str | Mapping[str, Any]) -> GenoLeWMConfig:
    """Load + validate a config payload; return a frozen :class:`GenoLeWMConfig`.

    ``source`` may be:

    * A :class:`Path` (or ``str``) — read the file as YAML.
    * A :class:`Mapping` — treat it as the already-parsed payload (used
      by ``--set`` override merging in PR #29 and by the unit tests).

    Validation:

    * Unknown top-level keys → :class:`UnknownTopLevelKeyError`.
    * Missing required subsystem keys → :class:`MissingConfigError`.
    * Wrong value type on any field → :class:`ConfigError`.
    """
    if isinstance(source, Mapping):
        payload: Any = source
    elif isinstance(source, str | Path):
        payload = _resolve_payload(source)
    else:
        raise InputError(
            "config payload must be a mapping at the top level",
            details={"got": type(source).__name__},
        )
    if not isinstance(payload, Mapping):
        raise InputError(
            "config payload must be a mapping at the top level",
            details={"got": type(payload).__name__},
        )
    return _build_top_level(dict(payload))

load_default

load_default(name: str) -> GenoLeWMConfig

Shorthand for load_config(DEFAULTS_DIR / f"{name}.yaml").

Accepts the documented command names (train / score / eval / plan). Raises :class:MissingConfigError if the YAML template is missing.

Source code in geno_lewm/config/loader.py
def load_default(name: str) -> GenoLeWMConfig:
    """Shorthand for ``load_config(DEFAULTS_DIR / f"{name}.yaml")``.

    Accepts the documented command names (``train`` / ``score`` /
    ``eval`` / ``plan``). Raises :class:`MissingConfigError` if the
    YAML template is missing.
    """
    target = DEFAULTS_DIR / f"{name}.yaml"
    if not target.is_file():
        raise MissingConfigError(
            f"no default config for command {name!r}",
            details={"path": str(target), "known": sorted(_known_defaults())},
        )
    return load_config(target)

write_resolved_config

write_resolved_config(cfg: GenoLeWMConfig, path: Path | str) -> Path

Write cfg as canonical YAML to path; return the absolute path.

Canonical = sort_keys=True, default_flow_style=False, no anchors. The result hashes byte-stably so the manifest's training.config_file hash matches between machines.

Source code in geno_lewm/config/loader.py
def write_resolved_config(cfg: GenoLeWMConfig, path: Path | str) -> Path:
    """Write ``cfg`` as canonical YAML to ``path``; return the absolute path.

    Canonical = ``sort_keys=True``, ``default_flow_style=False``, no
    anchors. The result hashes byte-stably so the manifest's
    ``training.config_file`` hash matches between machines.
    """
    target = Path(path)
    target.parent.mkdir(parents=True, exist_ok=True)
    text = yaml.safe_dump(
        config_to_dict(cfg),
        sort_keys=True,
        default_flow_style=False,
        allow_unicode=True,
    )
    target.write_text(text, encoding="utf-8")
    return target.resolve()

iter_top_level_field_names

iter_top_level_field_names() -> Iterator[str]

Yield the canonical top-level keys accepted by the loader.

The loader rejects any payload key not in this set via :class:geno_lewm.errors.UnknownTopLevelKeyError. Tests use this helper to assert the schema and the AC list stay in sync.

Source code in geno_lewm/config/schema.py
def iter_top_level_field_names() -> Iterator[str]:
    """Yield the canonical top-level keys accepted by the loader.

    The loader rejects any payload key not in this set via
    :class:`geno_lewm.errors.UnknownTopLevelKeyError`. Tests use this
    helper to assert the schema and the AC list stay in sync.
    """
    for f in fields(GenoLeWMConfig):
        yield f.name