geno_lewm.config¶
config
¶
Typed configuration surface (RFC-0017).
The schema is declared as nested frozen dataclasses (RFC-0017 §3.2 left
the choice between Pydantic v2 and dataclasses open; we chose
dataclasses to keep base runtime deps minimal — the package's only new
runtime dep introduced by this module is :mod:yaml).
Public surface:
- :class:
GenoLeWMConfig— top-level dataclass. - :class:
EncoderConfig, :class:PredictorConfig, :class:ActionEncoderConfig, :class:OptimizerConfig, :class:DataConfig, :class:EvalConfig, :class:ObservabilityConfig, :class:RuntimeConfig, :class:TrainingConfig— per-subsystem schemas. - :func:
load_config— load YAML + validate; raises :class:UnknownTopLevelKeyErroron unknown top-level keys (RFC-0017 §3.3). - :func:
write_resolved_config— emit the resolved config as canonical YAML so the run directory is auditable (RFC-0017 §3.5). - :func:
config_to_dict— pure dict view of a config tree (used by the manifest writer and the--print-configflag in PR #29). - :func:
describe_field— schema introspection for the--explainflag (PR #29). Returns the field's docstring snippet, type, and default value. - :data:
DEFAULTS_DIR— directory containing the canonical YAML templates fortrain,score,eval, andplancommands.
ActionEncoderConfig
dataclass
¶
ActionEncoderConfig(d_action: int = 64, max_len: int = 16, sub_encoders: tuple[str, ...] = ('snv', 'ins', 'del', 'mnv'))
Action encoder configuration (RFC-0003 §3.4).
DataConfig
dataclass
¶
DataConfig(corpus_id: str = 'HuggingFaceBio/carbon-pretraining-corpus', corpus_revision: str = 'main@cafef00d', batch_size: int = 64, num_workers: int = 4, shuffle_buffer: int = 4096)
Data pipeline configuration (RFC-0006).
EncoderConfig
dataclass
¶
EncoderConfig(model_id: str = 'HuggingFaceBio/Carbon-500M', revision: str = 'main@deadbeef', dtype: str = 'bf16', state_layer: int = 20, pool_type: str = 'centered_mean', pool_radius: int = 8, normalize: bool = True, trust_remote_code: bool = False)
State encoder configuration (RFC-0002 §3.1, §3.8).
The Phase 1 default is Carbon-500M with bf16 weights, pinned to a specific revision so the encoder hash committed to the manifest is reproducible.
EvalConfig
dataclass
¶
EvalConfig(benchmarks: tuple[str, ...] = ('clinvar_coding', 'clinvar_noncoding', 'rollout'), smoke_variants: int = 1000)
Evaluation harness (RFC-0007).
GenoLeWMConfig
dataclass
¶
GenoLeWMConfig(run_id: str = 'default', seed: int = 0, phase: Literal['phase1', 'phase2'] = 'phase1', encoder: EncoderConfig = EncoderConfig(), predictor: PredictorConfig = PredictorConfig(), action: ActionEncoderConfig = ActionEncoderConfig(), training: TrainingConfig = TrainingConfig(), optimizer: OptimizerConfig = OptimizerConfig(), data: DataConfig = DataConfig(), eval: EvalConfig = EvalConfig(), observability: ObservabilityConfig = ObservabilityConfig(), runtime: RuntimeConfig = RuntimeConfig(), deterministic: bool = False, schema_version: str = '1.0.0')
Top-level configuration object.
Every CLI command resolves to one of these. run_id is the
primary key for run artifacts (${run_id}/config.resolved.yaml,
${run_id}/checkpoints/*); the trainer auto-generates one if the
caller does not provide it.
The :data:schema_version field tracks the on-disk shape of
config.resolved.yaml — bumps follow RFC-0014's MAJOR/MINOR
rules on the config-resolution layer.
ObservabilityConfig
dataclass
¶
ObservabilityConfig(log_level: Literal['debug', 'info', 'warn', 'error'] = 'info', redaction_strict: bool = True, wandb_project: str | None = None)
Observability sinks (RFC-0013).
OptimizerConfig
dataclass
¶
OptimizerConfig(name: Literal['adamw', 'sgd-momentum'] = 'adamw', lr: float = 0.0003, beta1: float = 0.9, beta2: float = 0.95, weight_decay: float = 0.1, grad_clip: float = 1.0, warmup_steps: int = 1000, schedule: Literal['wsd', 'cosine', 'constant'] = 'wsd')
Optimizer + learning-rate schedule (RFC-0005).
PredictorConfig
dataclass
¶
PredictorConfig(architecture: str = 'cross_attention', n_layers: int = 6, n_heads: int = 8, d_state: int = 512, d_action: int = 64, dtype: str = 'bf16')
Action-conditioned predictor (RFC-0004 §3.1).
RuntimeConfig
dataclass
¶
RuntimeConfig(backend: Literal['onnx', 'coreml', 'gguf', 'torch'] = 'torch', device: Literal['cpu', 'cuda', 'mps'] = 'cpu')
Runtime / deployment target (RFC-0010).
TrainingConfig
dataclass
¶
Real Carbon-backed training launch controls.
max_steps is the configured horizon for geno-lewm-train
--carbon-train. Fixture smoke runs keep using the CLI --steps
control so small release-plumbing tests cannot silently define the
first real training horizon.
config_to_dict
¶
Return a plain dict view of cfg for serialization.
describe_field
¶
Return {type, default, doc} for dotted_key (e.g. encoder.dtype).
Raises :class:MissingConfigError if the key is not in the schema.
Source code in geno_lewm/config/loader.py
load_config
¶
Load + validate a config payload; return a frozen :class:GenoLeWMConfig.
source may be:
- A :class:
Path(orstr) — read the file as YAML. - A :class:
Mapping— treat it as the already-parsed payload (used by--setoverride merging in PR #29 and by the unit tests).
Validation:
- Unknown top-level keys → :class:
UnknownTopLevelKeyError. - Missing required subsystem keys → :class:
MissingConfigError. - Wrong value type on any field → :class:
ConfigError.
Source code in geno_lewm/config/loader.py
load_default
¶
Shorthand for load_config(DEFAULTS_DIR / f"{name}.yaml").
Accepts the documented command names (train / score /
eval / plan). Raises :class:MissingConfigError if the
YAML template is missing.
Source code in geno_lewm/config/loader.py
write_resolved_config
¶
Write cfg as canonical YAML to path; return the absolute path.
Canonical = sort_keys=True, default_flow_style=False, no
anchors. The result hashes byte-stably so the manifest's
training.config_file hash matches between machines.
Source code in geno_lewm/config/loader.py
iter_top_level_field_names
¶
Yield the canonical top-level keys accepted by the loader.
The loader rejects any payload key not in this set via
:class:geno_lewm.errors.UnknownTopLevelKeyError. Tests use this
helper to assert the schema and the AC list stay in sync.