04 — Error model¶
- Status: Authoritative for v0.1
- Companion RFC: RFC-0012
The error model is the contract on how subsystems fail. The whole of
GenoLeWM raises typed exceptions from a single hierarchy rooted at
geno_lewm.errors.GenoLeWMError. No subsystem swallows exceptions, no
public function returns None on failure, and no subsystem stringly-
types its errors.
Hierarchy¶
Exception
└── GenoLeWMError # root; all GenoLeWM exceptions inherit
├── ConfigError # bad configuration or missing field
│ ├── SchemaCompatError # on-disk schema MAJOR mismatch
│ └── MissingConfigError # required config field absent
│
├── InputError # caller-supplied input invalid
│ ├── InvalidEditError # EditSpec invariants violated
│ ├── UnsupportedEditError # edit type / length not in v1 scope
│ ├── WindowMismatchError # window ref bases ≠ EditSpec.ref
│ ├── OverlappingEditsError # haplotype edits overlap
│ ├── OutOfWindowError # rel_pos outside window
│ └── VcfParseError # malformed VCF / FASTA
│
├── ResourceError # capacity, IO, network
│ ├── CacheCorruptError # cache shard fails integrity check
│ ├── DiskFullError # storage exhausted during write
│ ├── OutOfMemoryError # explicit reraise of CUDA OOM with context
│ ├── ModelNotFoundError # checkpoint missing / not downloadable
│ ├── RuntimeSetupError # first-run network step failed
│ └── NetworkCallProhibitedError # runtime fail-closed (RFC-0010 §3.7)
│
├── TrainingError # training-loop-specific failures
│ ├── CollapseDetectedError # collapse alert criteria tripped
│ ├── NaNLossError # loss became NaN / Inf
│ └── DataLoaderError # data pipeline failure
│
├── EvalError # evaluation harness failures
│ ├── EvalDatasetError # benchmark data load failed
│ └── EvalRegressionError # smoke-eval gate failed
│
├── DeployError # export / runtime failures
│ ├── ExportFormatError # ONNX / Core ML / GGUF conversion failed
│ ├── QuantizationError # int8/int4 calibration failed
│ └── BackendUnsupportedError # backend not available on host
│
├── ProvenanceError # receipt/provenance failures
│ ├── ManifestHashMismatchError # manifest content != stated model_id
│ ├── InputCommitmentMismatchError # recomputed commitment != receipt
│ ├── OutputCommitmentMismatchError# bit-mismatch on re-run
│ ├── ProvenanceKindUnsupportedError # verifier doesn't know kind
│ └── ReceiptSchemaError # receipt JSON invalid
│
└── InternalError # bugs we caught; should never surface
├── InvariantViolation # an `INV-*` invariant was breached
└── UnreachableError # control flow reached "unreachable" branch
Every leaf class is concrete and named. Adding a new leaf is a MINOR change; removing or renaming one is a MAJOR change.
Error payloads¶
Every GenoLeWMError exposes a structured payload:
class GenoLeWMError(Exception):
code: str # stable error code, e.g., "INPUT.INVALID_EDIT"
message: str # human-readable
details: dict[str, object] # JSON-serializable structured fields
remediation: str | None # actionable hint when one exists
def to_dict(self) -> dict[str, object]: ...
def to_json(self) -> str: ...
Error codes are dotted-uppercase, prefixed by the top-level category, and documented in the registry. Examples:
INPUT.INVALID_EDITINPUT.UNSUPPORTED_EDITINPUT.WINDOW_MISMATCHRESOURCE.CACHE_CORRUPTRESOURCE.NETWORK_PROHIBITEDTRAINING.COLLAPSE_DETECTEDDEPLOY.QUANTIZATION_FAILEDPROVENANCE.MANIFEST_HASH_MISMATCH
Codes are part of the public surface. Renaming a code is a MAJOR change.
Error code registry¶
The registry lives at geno_lewm/errors.py::ERROR_CODES and is the source
of truth. A linter rule enforces that every code raised at runtime is
registered. The registry is regenerated into
docs/api/error-codes.md on each release.
Raise vs return discipline¶
| Situation | Mechanism |
|---|---|
| Caller-supplied data fails a documented invariant | raise typed InputError subclass |
| Expected absence (cache miss, optional field) | return None or sentinel; document in API |
| Resource exhaustion (memory, disk, network) | raise typed ResourceError subclass |
| Internal invariant violation | raise InvariantViolation; log at ERROR; never silent |
| Receipt/provenance check discovers a mismatch | raise typed ProvenanceError subclass |
| Training instability (NaN, collapse) | raise typed TrainingError; trainer can opt to catch |
| CLI top-level | catch GenoLeWMError, exit non-zero, print code + message |
The CLI catches at exactly one place: geno_lewm/cli/_dispatch.py. Library
callers see the full traceback unless they explicitly catch.
Exit codes¶
CLI exit codes:
| Code | Meaning |
|---|---|
| 0 | success |
| 1 | uncategorized failure (a bug in CLI dispatch; should be rare) |
| 2 | InputError family |
| 3 | ConfigError family |
| 4 | ResourceError family |
| 5 | TrainingError family |
| 6 | EvalError family |
| 7 | DeployError family |
| 8 | ProvenanceError family |
| 9 | InternalError family (please file a bug) |
| 130 | SIGINT |
Tooling that wraps the CLI relies on these codes; bumping them is a MAJOR change.
Failure modes by subsystem¶
The table is exhaustive for v0.1. Adding a new mode requires an RFC
update and a code registration.
| Subsystem | Failure mode | Detection | Raised as | Recovery |
|---|---|---|---|---|
| EditSpec | malformed bases | constructor invariant | InvalidEditError |
caller fixes input |
| EditSpec | length > V1_MAX_LEN | length check | UnsupportedEditError |
defer to v2 / SV adapter |
| Apply | ref-bases mismatch | string compare | WindowMismatchError |
re-fetch window or fix EditSpec |
| Haplotype | overlapping edits | interval check | OverlappingEditsError |
caller decomposes |
| Encoder | model not on Hub | HTTP 404 | ModelNotFoundError |
check --model-id, network |
| Encoder | OOM on long window | torch OOM | OutOfMemoryError |
use smaller window |
| Cache | shard truncated | length / CRC | CacheCorruptError |
geno-lewm-cache-windows --repair |
| Predictor | NaN/Inf in loss | per-step check | NaNLossError |
restart from last checkpoint |
| Trainer | collapse criteria tripped | monitoring hooks | CollapseDetectedError |
trainer stops; alert in wandb |
| Trainer | corrupt batch | dataloader exception | DataLoaderError |
log + skip; abort if rate > 1% |
| Eval | benchmark file absent | filesystem | EvalDatasetError |
geno-lewm-prepare-* |
| Eval | smoke-eval delta > threshold | comparison | EvalRegressionError |
block PR |
| Export | unsupported op for target | converter | ExportFormatError |
use alternative target |
| Export | int8 calibration failed | activation stats | QuantizationError |
re-run with more calibration data |
| Runtime | wrong backend at load | capability probe | BackendUnsupportedError |
reload with backend="auto" |
| Runtime | post-setup network call attempted | URL hook | NetworkCallProhibitedError |
bug — file a report |
| Receipt | malformed JSON | schema check | ReceiptSchemaError |
verifier rejects |
| Verifier | manifest hash mismatch | hash compare | ManifestHashMismatchError |
verifier rejects |
| Verifier | input commitment mismatch | hash compare | InputCommitmentMismatchError |
verifier rejects |
| Verifier | output commitment mismatch | hash compare | OutputCommitmentMismatchError |
verifier rejects |
| Internal | invariant breach | runtime assert | InvariantViolation |
bug |
Logging error events¶
Every raised GenoLeWMError emits a structured log event:
{"event": "error", "code": <code>, "message": <message>, "details": {...},
"remediation": <remediation>, "ts": <iso-8601>}
See 05-observability.md for the log format. Error
events are tagged with severity=error. InternalError events
additionally include a stack trace and the file/line of the raise.
Translation to receipts¶
The receipt format (RFC-0011 §3.3) does not include error details; a
failed inference does not produce a receipt. Partial-failure semantics
(e.g., per-variant errors during VCF scoring) are recorded in an adjacent
.errors.jsonl file with one error record per failed variant.
Invariants¶
| ID | Invariant | Enforced by |
|---|---|---|
| INV-ERR-1 | Every raised exception in geno_lewm/ is a GenoLeWMError subclass |
linter rule |
| INV-ERR-2 | Every raise statement uses a code registered in ERROR_CODES |
linter rule |
| INV-ERR-3 | No public function returns None to signal failure |
type checker |
| INV-ERR-4 | Test suite asserts both the exception class and the code for every documented failure mode |
tests/unit/test_errors.py |
| INV-ERR-5 | Error messages do not embed user data; structured details carry it |
redaction filter in observability.py |
Open questions¶
| ID | Question | Owner | Target |
|---|---|---|---|
| OQ-ERR-1 | Whether to add cause-chain inspection helpers for __cause__ reading |
core | v0.2 |
| OQ-ERR-2 | Whether to expose typed Result[T, E] returns at API boundaries (Rust-style) |
core | v0.3 |
| OQ-ERR-3 | Whether to integrate with sentry-sdk or similar — only acceptable if redaction-by-default is non-negotiable |
core | post-v1 |