08 — Performance budget¶
Performance is part of the public release contract. Numbers here are
budgets and targets unless explicitly reported from a measured release
artifact. The v0.1 release includes a public efficiency_report.json for
the first checkpoint, dataset, and demo artifact set; the broader
Apple-Silicon, quantized-runtime, and rollout-speed budgets still need
v0.2 measurement before docs can describe them as achieved.
Reference machines¶
| Class | Spec | Where used |
|---|---|---|
| Server | 1× NVIDIA H100 80 GB SXM5, 256 GB RAM, NVMe SSD | training, batched inference |
| Workstation | 1× NVIDIA RTX 4090 24 GB, 64 GB RAM, NVMe SSD | researcher local |
| Laptop | Apple M3 Max, 64 GB unified memory, NVMe SSD | on-device target |
| CPU-only laptop | 8-core x86-64, 32 GB RAM, NVMe SSD | accessibility fallback |
Per-PR checks may run deterministic microbenchmarks where suitable hardware is available. Release-candidate benchmarks must run on a dedicated reference machine per class and record explicit limitations when a class is unavailable.
Inference latency (warm cache)¶
| Metric | H100 | RTX 4090 | M3 Max | CPU-only |
|---|---|---|---|---|
| Single-variant scoring (warm) | < 5 ms | < 20 ms | < 200 ms | < 1.5 s |
| Single-variant scoring (cold; Carbon call) | < 50 ms | < 100 ms | < 800 ms | < 6 s |
| Predictor-only forward pass (bf16) | < 1 ms | < 3 ms | < 25 ms | < 200 ms |
| Action-encoder forward pass | < 0.2 ms | < 0.5 ms | < 5 ms | < 30 ms |
Cold path includes one Carbon-500M call on the edited window; the warm path assumes both windows are in the LRU.
Inference throughput (batched)¶
| Metric (batch=256) | H100 | RTX 4090 | M3 Max |
|---|---|---|---|
| Variants per second | > 5,000 | > 1,000 | > 100 |
| Tokens per second (Carbon-500M, bf16) | > 1.0 M | > 250 k | > 30 k |
Batched throughput is measured after warm-up (10 batches discarded); reported value is the median across 100 batches.
Memory footprint¶
| Metric | H100 | RTX 4090 | M3 Max |
|---|---|---|---|
| Predictor only (loaded) | < 200 MB | < 200 MB | < 200 MB |
| Predictor + Carbon-500M (bf16) | < 3 GB | < 3 GB | < 8 GB |
| Predictor + Carbon-500M (int4) | n/a | n/a | < 1 GB |
| Process RSS at idle (no model) | < 200 MB | < 200 MB | < 300 MB |
| Process RSS during VCF scoring (b=64) | < 4 GB | < 4 GB | < 10 GB |
Peak memory is measured via psutil (RSS) and nvidia-smi (GPU).
Apple Silicon uses unified memory; reported value is the larger of CPU
RSS and Metal peak.
Release candidates record the actual first-paper single-variant latency,
batched throughput, and peak-memory run as
efficiency_report.json, generated with
python -m bench.inference --release-efficiency ... --output-json ....
That artifact must include the benchmark command, hardware/runtime notes,
input identities, sample count, warm-up count, and limitations; prose in
the README or paper draft must not substitute for the machine-readable
report.
Disk footprint¶
| Item | Size |
|---|---|
| GenoLeWM artifact at bf16 (predictor + action encoder + calibration) | < 120 MB |
| GenoLeWM artifact at int8 | < 40 MB |
| Carbon-500M at bf16 (Hub download) | ~1 GB |
| Reference embedding cache (default config, full human genome) | < 5 GB |
| Sum of model + cache + tokenizer + receipts (typical user) | < 5 GB |
Whole-VCF scoring¶
| Operation | H100 | RTX 4090 | M3 Max |
|---|---|---|---|
| 100k-variant VCF scoring | < 1 min | < 5 min | < 30 min |
| Whole-genome reference encoding | < 30 min | < 2 h | < 8 h |
These targets assume the reference embedding cache is empty at the start (cold path). With a warm cache, throughput approaches the per-variant predictor throughput.
Training compute budget¶
| Phase | Tokens (target embeddings) | Wall clock on H100 |
|---|---|---|
| Phase 1 baseline | ~10 B | ~24 h |
| Phase 1 full | ~50 B | ~5 d |
| Phase 2 (LoRA + full edit suite) | ~100 B | ~10 d |
The Phase 1 baseline must complete in under 24 hours on a single H100; an
implementation that breaks this budget is not shippable.
The first-experiment training config is a CUDA job config, and
training_preflight_report.json rejects the release run unless
runtime.device: cuda resolves to an available accelerator with at least
40 GiB of device memory by default. A100/H100-class runners are the
supported target for this release pass; smaller GPUs may be useful for
debugging only when the preflight threshold is explicitly lowered and the
result is not treated as first-paper evidence.
Planning latency¶
CEM with n_iterations=5, n_samples=1024, K=5 (default config):
| Metric | H100 | M3 Max |
|---|---|---|
| Planning call (end-to-end) | < 1 s | < 30 s |
| Predictor calls per planning call | 25,600 | 25,600 |
Planning with larger n_samples or K scales linearly in predictor
calls.
Quantization quality budget¶
A quantized variant is shippable only when:
| Quantization | Max AUROC drop on ClinVar coding | Max rollout cosine drop |
|---|---|---|
int8 (predictor only) |
1.0 | 0.02 |
int4 (Carbon Q4_K_M + predictor int8) |
2.0 | 0.05 |
Drops are measured against the bf16 reference checkpoint on the same data.
Profiling discipline¶
- CPU profiling:
py-spy(sampled) for CLI commands;cProfilefor small reproducers. - GPU profiling:
torch.profilerfor training; NVIDIA Nsight Systems for ad-hoc deep dives. - Memory profiling:
tracemallocfor Python;nvidia-smi/Metal System Trace for device memory. - Microbenchmarks:
pytest-benchmarkfor hot functions; results persisted inbench/results/.
After the first measured release baseline exists, every performance regression > 5% on any benchmark in this section is treated as a defect. CI should not auto-fail on noisy shared-runner perf changes, but release automation or a scheduled benchmark job must open a tracking issue when a confirmed regression is detected.
Performance regression process¶
- Nightly benchmark detects a > 5% regression.
- Tracking issue opened with the suspect range of commits.
- Bisect to the culprit commit.
- Either revert, fix, or formally accept the regression with an RFC amendment + CHANGELOG entry.
Invariants¶
| ID | Invariant | Enforced by |
|---|---|---|
| INV-PERF-1 | The Phase-1 baseline run completes within 24 hours on a single H100 | release-gate check |
| INV-PERF-2 | Predictor + Carbon-500M run within 8 GB unified memory on M3 Max at int4 | release-gate check |
| INV-PERF-3 | Single-variant warm-cache scoring on M3 Max ≤ 200 ms | bench.inference --release-efficiency; tools.release.efficiency_report |
| INV-PERF-4 | No benchmark in this document regresses > 5% silently between releases | nightly job |
| INV-PERF-5 | Memory targets are measured peak-RSS, not steady-state | benchmark harness |
Open questions¶
| ID | Question | Owner | Target |
|---|---|---|---|
| OQ-PERF-1 | Whether to add iOS / iPad performance targets in v0.2 | core | v0.2 |
| OQ-PERF-2 | Whether to ship a geno-lewm-bench command exposing the benchmark harness publicly |
core | v0.2 |
| OQ-PERF-3 | What public latency target to require for the real terminal demo | core | first demo release |