08 — Performance budget¶

Status: Authoritative for v0.1
Companion RFCs: RFC-0007, RFC-0010, RFC-0016

Performance is part of the public release contract. Numbers here are budgets and targets unless explicitly reported from a measured release artifact. The v0.1 release includes a public efficiency_report.json for the first checkpoint, dataset, and demo artifact set; the broader Apple-Silicon, quantized-runtime, and rollout-speed budgets still need v0.2 measurement before docs can describe them as achieved.

Reference machines¶

Class	Spec	Where used
Server	1× NVIDIA H100 80 GB SXM5, 256 GB RAM, NVMe SSD	training, batched inference
Workstation	1× NVIDIA RTX 4090 24 GB, 64 GB RAM, NVMe SSD	researcher local
Laptop	Apple M3 Max, 64 GB unified memory, NVMe SSD	on-device target
CPU-only laptop	8-core x86-64, 32 GB RAM, NVMe SSD	accessibility fallback

Per-PR checks may run deterministic microbenchmarks where suitable hardware is available. Release-candidate benchmarks must run on a dedicated reference machine per class and record explicit limitations when a class is unavailable.

Inference latency (warm cache)¶

Metric	H100	RTX 4090	M3 Max	CPU-only
Single-variant scoring (warm)	< 5 ms	< 20 ms	< 200 ms	< 1.5 s
Single-variant scoring (cold; Carbon call)	< 50 ms	< 100 ms	< 800 ms	< 6 s
Predictor-only forward pass (bf16)	< 1 ms	< 3 ms	< 25 ms	< 200 ms
Action-encoder forward pass	< 0.2 ms	< 0.5 ms	< 5 ms	< 30 ms

Cold path includes one Carbon-500M call on the edited window; the warm path assumes both windows are in the LRU.

Inference throughput (batched)¶

Metric (batch=256)	H100	RTX 4090	M3 Max
Variants per second	> 5,000	> 1,000	> 100
Tokens per second (Carbon-500M, bf16)	> 1.0 M	> 250 k	> 30 k

Batched throughput is measured after warm-up (10 batches discarded); reported value is the median across 100 batches.

Memory footprint¶

Metric	H100	RTX 4090	M3 Max
Predictor only (loaded)	< 200 MB	< 200 MB	< 200 MB
Predictor + Carbon-500M (bf16)	< 3 GB	< 3 GB	< 8 GB
Predictor + Carbon-500M (int4)	n/a	n/a	< 1 GB
Process RSS at idle (no model)	< 200 MB	< 200 MB	< 300 MB
Process RSS during VCF scoring (b=64)	< 4 GB	< 4 GB	< 10 GB

Peak memory is measured via psutil (RSS) and nvidia-smi (GPU). Apple Silicon uses unified memory; reported value is the larger of CPU RSS and Metal peak.

Release candidates record the actual first-paper single-variant latency, batched throughput, and peak-memory run as efficiency_report.json, generated with python -m bench.inference --release-efficiency ... --output-json .... That artifact must include the benchmark command, hardware/runtime notes, input identities, sample count, warm-up count, and limitations; prose in the README or paper draft must not substitute for the machine-readable report.

Disk footprint¶

Item	Size
GenoLeWM artifact at bf16 (predictor + action encoder + calibration)	< 120 MB
GenoLeWM artifact at int8	< 40 MB
Carbon-500M at bf16 (Hub download)	~1 GB
Reference embedding cache (default config, full human genome)	< 5 GB
Sum of model + cache + tokenizer + receipts (typical user)	< 5 GB

Whole-VCF scoring¶

Operation	H100	RTX 4090	M3 Max
100k-variant VCF scoring	< 1 min	< 5 min	< 30 min
Whole-genome reference encoding	< 30 min	< 2 h	< 8 h

These targets assume the reference embedding cache is empty at the start (cold path). With a warm cache, throughput approaches the per-variant predictor throughput.

Training compute budget¶

Phase	Tokens (target embeddings)	Wall clock on H100
Phase 1 baseline	~10 B	~24 h
Phase 1 full	~50 B	~5 d
Phase 2 (LoRA + full edit suite)	~100 B	~10 d

The Phase 1 baseline must complete in under 24 hours on a single H100; an implementation that breaks this budget is not shippable. The first-experiment training config is a CUDA job config, and training_preflight_report.json rejects the release run unless runtime.device: cuda resolves to an available accelerator with at least 40 GiB of device memory by default. A100/H100-class runners are the supported target for this release pass; smaller GPUs may be useful for debugging only when the preflight threshold is explicitly lowered and the result is not treated as first-paper evidence.

Planning latency¶

CEM with n_iterations=5, n_samples=1024, K=5 (default config):

Metric	H100	M3 Max
Planning call (end-to-end)	< 1 s	< 30 s
Predictor calls per planning call	25,600	25,600

Planning with larger n_samples or K scales linearly in predictor calls.

Quantization quality budget¶

A quantized variant is shippable only when:

Quantization	Max AUROC drop on ClinVar coding	Max rollout cosine drop
`int8` (predictor only)	1.0	0.02
`int4` (Carbon Q4_K_M + predictor int8)	2.0	0.05

Drops are measured against the bf16 reference checkpoint on the same data.

Profiling discipline¶

CPU profiling: py-spy (sampled) for CLI commands; cProfile for small reproducers.
GPU profiling: torch.profiler for training; NVIDIA Nsight Systems for ad-hoc deep dives.
Memory profiling: tracemalloc for Python; nvidia-smi/Metal System Trace for device memory.
Microbenchmarks: pytest-benchmark for hot functions; results persisted in bench/results/.

After the first measured release baseline exists, every performance regression > 5% on any benchmark in this section is treated as a defect. CI should not auto-fail on noisy shared-runner perf changes, but release automation or a scheduled benchmark job must open a tracking issue when a confirmed regression is detected.

Performance regression process¶

Nightly benchmark detects a > 5% regression.
Tracking issue opened with the suspect range of commits.
Bisect to the culprit commit.
Either revert, fix, or formally accept the regression with an RFC amendment + CHANGELOG entry.

Invariants¶

ID	Invariant	Enforced by
INV-PERF-1	The Phase-1 baseline run completes within 24 hours on a single H100	release-gate check
INV-PERF-2	Predictor + Carbon-500M run within 8 GB unified memory on M3 Max at int4	release-gate check
INV-PERF-3	Single-variant warm-cache scoring on M3 Max ≤ 200 ms	`bench.inference --release-efficiency`; `tools.release.efficiency_report`
INV-PERF-4	No benchmark in this document regresses > 5% silently between releases	nightly job
INV-PERF-5	Memory targets are measured peak-RSS, not steady-state	benchmark harness

Open questions¶

ID	Question	Owner	Target
OQ-PERF-1	Whether to add iOS / iPad performance targets in v0.2	core	v0.2
OQ-PERF-2	Whether to ship a `geno-lewm-bench` command exposing the benchmark harness publicly	core	v0.2
OQ-PERF-3	What public latency target to require for the real terminal demo	core	first demo release