Skip to content

04 — Error model

  • Status: Authoritative for v0.1
  • Companion RFC: RFC-0012

The error model is the contract on how subsystems fail. The whole of GenoLeWM raises typed exceptions from a single hierarchy rooted at geno_lewm.errors.GenoLeWMError. No subsystem swallows exceptions, no public function returns None on failure, and no subsystem stringly- types its errors.

Hierarchy

Exception
└── GenoLeWMError                    # root; all GenoLeWM exceptions inherit
    ├── ConfigError                  # bad configuration or missing field
    │   ├── SchemaCompatError        # on-disk schema MAJOR mismatch
    │   └── MissingConfigError       # required config field absent
    ├── InputError                   # caller-supplied input invalid
    │   ├── InvalidEditError         # EditSpec invariants violated
    │   ├── UnsupportedEditError     # edit type / length not in v1 scope
    │   ├── WindowMismatchError      # window ref bases ≠ EditSpec.ref
    │   ├── OverlappingEditsError    # haplotype edits overlap
    │   ├── OutOfWindowError         # rel_pos outside window
    │   └── VcfParseError            # malformed VCF / FASTA
    ├── ResourceError                # capacity, IO, network
    │   ├── CacheCorruptError        # cache shard fails integrity check
    │   ├── DiskFullError            # storage exhausted during write
    │   ├── OutOfMemoryError         # explicit reraise of CUDA OOM with context
    │   ├── ModelNotFoundError       # checkpoint missing / not downloadable
    │   ├── RuntimeSetupError        # first-run network step failed
    │   └── NetworkCallProhibitedError  # runtime fail-closed (RFC-0010 §3.7)
    ├── TrainingError                # training-loop-specific failures
    │   ├── CollapseDetectedError    # collapse alert criteria tripped
    │   ├── NaNLossError             # loss became NaN / Inf
    │   └── DataLoaderError          # data pipeline failure
    ├── EvalError                    # evaluation harness failures
    │   ├── EvalDatasetError         # benchmark data load failed
    │   └── EvalRegressionError      # smoke-eval gate failed
    ├── DeployError                  # export / runtime failures
    │   ├── ExportFormatError        # ONNX / Core ML / GGUF conversion failed
    │   ├── QuantizationError        # int8/int4 calibration failed
    │   └── BackendUnsupportedError  # backend not available on host
    ├── ProvenanceError             # receipt/provenance failures
    │   ├── ManifestHashMismatchError    # manifest content != stated model_id
    │   ├── InputCommitmentMismatchError # recomputed commitment != receipt
    │   ├── OutputCommitmentMismatchError# bit-mismatch on re-run
    │   ├── ProvenanceKindUnsupportedError  # verifier doesn't know kind
    │   └── ReceiptSchemaError       # receipt JSON invalid
    └── InternalError                # bugs we caught; should never surface
        ├── InvariantViolation       # an `INV-*` invariant was breached
        └── UnreachableError         # control flow reached "unreachable" branch

Every leaf class is concrete and named. Adding a new leaf is a MINOR change; removing or renaming one is a MAJOR change.

Error payloads

Every GenoLeWMError exposes a structured payload:

class GenoLeWMError(Exception):
    code: str                       # stable error code, e.g., "INPUT.INVALID_EDIT"
    message: str                    # human-readable
    details: dict[str, object]      # JSON-serializable structured fields
    remediation: str | None         # actionable hint when one exists

    def to_dict(self) -> dict[str, object]: ...
    def to_json(self) -> str: ...

Error codes are dotted-uppercase, prefixed by the top-level category, and documented in the registry. Examples:

  • INPUT.INVALID_EDIT
  • INPUT.UNSUPPORTED_EDIT
  • INPUT.WINDOW_MISMATCH
  • RESOURCE.CACHE_CORRUPT
  • RESOURCE.NETWORK_PROHIBITED
  • TRAINING.COLLAPSE_DETECTED
  • DEPLOY.QUANTIZATION_FAILED
  • PROVENANCE.MANIFEST_HASH_MISMATCH

Codes are part of the public surface. Renaming a code is a MAJOR change.

Error code registry

The registry lives at geno_lewm/errors.py::ERROR_CODES and is the source of truth. A linter rule enforces that every code raised at runtime is registered. The registry is regenerated into docs/api/error-codes.md on each release.

Raise vs return discipline

Situation Mechanism
Caller-supplied data fails a documented invariant raise typed InputError subclass
Expected absence (cache miss, optional field) return None or sentinel; document in API
Resource exhaustion (memory, disk, network) raise typed ResourceError subclass
Internal invariant violation raise InvariantViolation; log at ERROR; never silent
Receipt/provenance check discovers a mismatch raise typed ProvenanceError subclass
Training instability (NaN, collapse) raise typed TrainingError; trainer can opt to catch
CLI top-level catch GenoLeWMError, exit non-zero, print code + message

The CLI catches at exactly one place: geno_lewm/cli/_dispatch.py. Library callers see the full traceback unless they explicitly catch.

Exit codes

CLI exit codes:

Code Meaning
0 success
1 uncategorized failure (a bug in CLI dispatch; should be rare)
2 InputError family
3 ConfigError family
4 ResourceError family
5 TrainingError family
6 EvalError family
7 DeployError family
8 ProvenanceError family
9 InternalError family (please file a bug)
130 SIGINT

Tooling that wraps the CLI relies on these codes; bumping them is a MAJOR change.

Failure modes by subsystem

The table is exhaustive for v0.1. Adding a new mode requires an RFC update and a code registration.

Subsystem Failure mode Detection Raised as Recovery
EditSpec malformed bases constructor invariant InvalidEditError caller fixes input
EditSpec length > V1_MAX_LEN length check UnsupportedEditError defer to v2 / SV adapter
Apply ref-bases mismatch string compare WindowMismatchError re-fetch window or fix EditSpec
Haplotype overlapping edits interval check OverlappingEditsError caller decomposes
Encoder model not on Hub HTTP 404 ModelNotFoundError check --model-id, network
Encoder OOM on long window torch OOM OutOfMemoryError use smaller window
Cache shard truncated length / CRC CacheCorruptError geno-lewm-cache-windows --repair
Predictor NaN/Inf in loss per-step check NaNLossError restart from last checkpoint
Trainer collapse criteria tripped monitoring hooks CollapseDetectedError trainer stops; alert in wandb
Trainer corrupt batch dataloader exception DataLoaderError log + skip; abort if rate > 1%
Eval benchmark file absent filesystem EvalDatasetError geno-lewm-prepare-*
Eval smoke-eval delta > threshold comparison EvalRegressionError block PR
Export unsupported op for target converter ExportFormatError use alternative target
Export int8 calibration failed activation stats QuantizationError re-run with more calibration data
Runtime wrong backend at load capability probe BackendUnsupportedError reload with backend="auto"
Runtime post-setup network call attempted URL hook NetworkCallProhibitedError bug — file a report
Receipt malformed JSON schema check ReceiptSchemaError verifier rejects
Verifier manifest hash mismatch hash compare ManifestHashMismatchError verifier rejects
Verifier input commitment mismatch hash compare InputCommitmentMismatchError verifier rejects
Verifier output commitment mismatch hash compare OutputCommitmentMismatchError verifier rejects
Internal invariant breach runtime assert InvariantViolation bug

Logging error events

Every raised GenoLeWMError emits a structured log event:

{"event": "error", "code": <code>, "message": <message>, "details": {...},
 "remediation": <remediation>, "ts": <iso-8601>}

See 05-observability.md for the log format. Error events are tagged with severity=error. InternalError events additionally include a stack trace and the file/line of the raise.

Translation to receipts

The receipt format (RFC-0011 §3.3) does not include error details; a failed inference does not produce a receipt. Partial-failure semantics (e.g., per-variant errors during VCF scoring) are recorded in an adjacent .errors.jsonl file with one error record per failed variant.

Invariants

ID Invariant Enforced by
INV-ERR-1 Every raised exception in geno_lewm/ is a GenoLeWMError subclass linter rule
INV-ERR-2 Every raise statement uses a code registered in ERROR_CODES linter rule
INV-ERR-3 No public function returns None to signal failure type checker
INV-ERR-4 Test suite asserts both the exception class and the code for every documented failure mode tests/unit/test_errors.py
INV-ERR-5 Error messages do not embed user data; structured details carry it redaction filter in observability.py

Open questions

ID Question Owner Target
OQ-ERR-1 Whether to add cause-chain inspection helpers for __cause__ reading core v0.2
OQ-ERR-2 Whether to expose typed Result[T, E] returns at API boundaries (Rust-style) core v0.3
OQ-ERR-3 Whether to integrate with sentry-sdk or similar — only acceptable if redaction-by-default is non-negotiable core post-v1