RFC-0015: Testing strategy and CI gates¶
- Status: Draft
- Author(s): GenoLeWM Project
- Created: 2026-05-20
- Updated: 2026-06-02
- Depends on: RFC-0007, RFC-0012, RFC-0013
- Supersedes: —
- Implementation status: Partial — unit, property, integration,
lint, API snapshot, docs, release-contract, fixture-tier ML, event
registry, and scope-language gates exist. Dedicated hosted
tests/mland eval-smoke CI gates remain open.
1. Summary¶
This RFC specifies the five-layer test pyramid (unit, property, ML, integration, eval), the per-PR and per-release CI gates, the fixture discipline, and the reproducibility requirements. The eval suite itself is specified in RFC-0007; this RFC defines the engineering harness around it.
2. Motivation¶
ML projects have a recurring failure mode: lots of model-quality numbers, zero engineering tests. GenoLeWM commits to both. The testing strategy is the structure that makes "no release without all gates green" a mechanical claim rather than a slogan.
3. Specification¶
The full contract is in docs/spec/07-testing-strategy.md.
Load-bearing decisions:
3.1 Layers¶
- Unit (
tests/unit/): isolated behavior, ≥ 90% line coverage on touched modules per PR. - Property (
tests/property/): everyINV-*invariant in the spec corpus. - ML (
tests/ml/): identity-at-init, loss-decreases-on-fixture, no-NaN/Inf, collapse heuristics, receipt determinism. - Integration (
tests/integration/): train→eval smoke, score VCF, export round-trip, verifier accept/reject. - Eval (
tests/eval/): smoke eval (1k variants) and full eval (RFC-0007).
3.2 Per-PR CI gates¶
In CI order:
ruff check .ruff format --check .mypy --strict geno_lewm/- Custom AST checks (no
print, network imports confined, error discipline). pytest tests/unitpytest tests/property --hypothesis-seed=<commit-hash>pytest tests/mlpytest tests/integration -k 'not slow'- Coverage gate (changed files ≥ 90%).
- Smoke eval (1k ClinVar coding + 500-window rollout) if PR touches relevant paths.
- License headers present on every
geno_lewm/*.py.
3.3 Per-release gates¶
All per-PR gates plus:
- Full eval suite (RFC-0007).
- Performance benchmarks
(
docs/spec/08-performance-budget.md). - Reproducibility: build twice, compare artifact hashes.
- Receipt re-verification across machines on supported backends.
- Privacy property test: 10 k random payloads, zero leaks.
- Manual checklist (clinical banner, SECURITY contacts, CHANGELOG).
3.4 Custom AST checks¶
Implemented in tools/lint/. Each is a tree walk that fails CI on
violation:
| Check | Rule |
|---|---|
no_print |
print(...) not allowed in geno_lewm/ |
network_confined |
urllib, httpx, requests, aiohttp imports only in geno_lewm/deploy/runtime.py and geno_lewm/cli/update.py |
raise_geno_lewm_error |
every raise X(...) has X a subclass of GenoLeWMError |
registered_error_code |
every GenoLeWMError(code=...) uses a literal in ERROR_CODES |
registered_event_name |
every logger.info(event=...) uses a literal in EVENTS |
registered_metric_name |
every counter.inc("name") / histogram.observe("name") uses a literal in METRICS |
The checks live alongside the code they enforce; they have unit tests of their own.
3.5 Fixture discipline¶
Fixtures live under tests/fixtures/. None contain real user data.
Small fixtures (< 1 MB) are committed; larger ones are generated from
seeded random by tests/conftest.py. Each fixture's docstring documents
seed and provenance.
3.6 Reproducibility¶
PYTEST_RANDOM_SEEDenv (default: commit hash mod 2^32).- ML tests pin
torch.manual_seedandnumpy.random.seed. - Deterministic backends required for verifier tests.
- Tests do not write outside
tmp_pathprovided by pytest.
3.7 Coverage tool¶
pytest-cov with branch coverage enabled. The coverage gate is
changed-files coverage, not whole-project coverage, to avoid the
ratchet pathology of new code lowering global numbers.
3.8 Nightly / cron jobs¶
- Larger smoke eval (5 k variants).
- Memory regression check.
- Cross-platform smoke (macOS, Linux, Windows on CPU-only).
- Dependency audit (
pip-audit).
Failures open a tracking issue automatically; they do not block PRs.
4. Rationale and alternatives¶
4.1 Why a separate ML test layer rather than mixing into unit / integration?¶
ML-specific failures (collapse, NaN drift) have a different shape from either unit or integration failures. Keeping the layer separate lets us budget runtime explicitly and gate on the right kinds of regression.
4.2 Why changed-files coverage instead of project-wide?¶
A project-wide coverage gate punishes new contributions that add correctly-tested code into a file whose siblings happen to have low coverage. Changed-files coverage is fair and equally tight.
4.3 Why hypothesis-seed from the commit hash?¶
Reproducibility. A property test failure on commit abc1234 must be
reproducible on commit abc1234 regardless of when the run happened.
Pinning the seed to the commit hash gives that for free.
4.4 Why AST checks rather than runtime asserts?¶
AST checks fail at PR review time. Runtime asserts fail in production. Same correctness contract, drastically lower cost when violated.
4.5 Why is the smoke eval gated on path-touched paths only?¶
A docs-only PR should not run a 5-minute eval. A predictor-touching PR should. Path-touched gating is the simplest right answer.
5. Unresolved questions¶
- Whether to add
tests/benchmark/withpytest-benchmarkhistorical comparisons. Tracked as OQ-TEST-1. - Whether the smoke eval should include a non-coding ClinVar subset by default. Tracked as OQ-TEST-2.
- When to enable
python -X devin CI. Tracked as OQ-TEST-3. - Whether to add mutation testing (
mutmut) as a gate; currently reserved for post-v1.
6. Future work¶
- An ML-eval cache that memoizes evals by
model_idso repeated CI runs don't re-eval unchanged checkpoints. - A
geno-lewm-benchCLI exposing the benchmark harness publicly (related to OQ-PERF-2). - Cross-version checkpoint compatibility tests run nightly against the prior MAJOR.
7. Changelog¶
- 2026-06-02 — Updated implementation status for current local/CI test gates and remaining hosted ML/eval smoke gaps.
- 2026-05-20 — Initial draft.