06 — Security¶
GenoLeWM operates over the most sensitive consumer data category there is: personal genome data. The security model is therefore conservative, local-first, and built around explicit trust boundaries. Threats are enumerated, not implied. Defenses are mechanical, not policy.
Current public release status: the v0.1 model package and terminal demo are published and record runtime-preflight, network-guard, redaction, and checksum receipt evidence for the released demo path. That evidence does not establish a general privacy assurance, clinical safety claim, or runtime assurance mode beyond checksum provenance.
Threat model¶
Assets¶
| Asset | Why it matters | Where it lives |
|---|---|---|
| User VCF / variant data | Permanent, identifying, family-implicating | User filesystem only |
| Reference embeddings cache | Public-domain (reference genome derivatives) | User filesystem |
| Trained model weights | Public; signed by maintainers | Hugging Face Hub + user filesystem |
| Receipts | Checksum provenance | User filesystem; shareable |
| Manifest | Trust anchor for model identity | Inside checkpoint |
| Calibration table | Affects score interpretation | Inside checkpoint |
| Surprise scores | User-derived analysis; user-owned | User filesystem only |
Attackers¶
| Attacker | Capability | Goal |
|---|---|---|
| Network attacker | passive eavesdropping or active MitM during first-run downloads | exfiltrate variant data; substitute model weights |
| Local malware | arbitrary process on the user's machine | exfiltrate variant data; tamper with model |
| Supply-chain attacker | publishes a backdoored release upstream | poisoned weights produce manipulated scores |
| Curious recipient of a shared result | wants to learn the input that produced a published score | reverse-engineer the input from receipt |
| Misconfigured operator | mis-uses runtime in cloud context | accidentally leaks variants |
Out of attacker model (explicit)¶
- Physical-access attackers with control of the hardware.
- Side-channel attackers on shared hardware (mitigated only when running on dedicated user hardware, which is the deployment context).
- Cryptographic break of SHA-256.
Trust boundaries¶
┌─────────────────────────────────────────────────────────────┐
│ User device │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ GenoLeWM runtime (untrusted to the user beyond manifest │ │
│ │ verification; trusted to itself once verified) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Carbon │ │ Predictor │ │ Provenance │ │ │
│ │ │ (frozen, by │ │ (small │ │ (manifest + │ │ │
│ │ │ public │ │ trainable │ │ receipts) │ │ │
│ │ │ commitment) │ │ head) │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ User VCF ──► runtime (in-process only) ──► scores + receipt │
└──────────────────────────────────────────────────────────────┘
│
│ explicit user-initiated network calls only:
│ - first-run weight download (Hugging Face Hub, pinned hash)
│ - explicit `geno-lewm-update`
▼
Hugging Face Hub (untrusted; verified by content hash)
The released runtime must fail closed on any non-allowlisted network call. The allowlist is a single configuration file shipped with the build, and the loader verifies its hash on startup.
Defenses¶
1. Local-first release requirement¶
- The released runtime must make no network call after first-run setup.
- A guard at the HTTP-client construction site raises
NetworkCallProhibitedErrorif any call is attempted outside the documented allowlist. - The desktop app release target has no cloud sync, no accounts, no telemetry. See RFC-0010 §3.7.
2. Content-addressed weights¶
- Every weight file is identified by SHA-256 of its
safetensorsbytes. - The manifest binds weights, calibration, tokenizer, and configuration
into one canonical-JSON document whose hash is the
model_id. - On load, the runtime recomputes every hash and refuses to start if any fails.
- Pinning to
model_idis the published-results contract.
3. Input commitments and receipts¶
- Single-variant scoring can produce a receipt binding
(model_id, input_commitment, output, runtime metadata)(03-data-model.md). - VCF scoring can produce a JSONL sidecar with one v1 receipt per scored alternate. Receipts contain commitments and runtime metadata, not raw DNA sequence or VCF fields.
- Receipts are checksum-only in the active schema.
batch_receipt_report.jsonis a release artifact that verifies the score JSONL and per-row receipt JSONL streams as one batch. It does not collapse a multi-row VCF output into one single-output v1 receipt.
4. Redaction by default¶
- The observability layer (see
05-observability.md) rejects DNA strings ≥ 20 bp, drops unallowlisted keys, and fails closed underGENO_LEWM_REDACTION_STRICT=1(the default). - Crash dumps sanitize locals before write.
- VCF parsing never logs variant content.
5. Reproducible builds¶
- Release artifacts must be built from pinned dependency lockfiles. The CI release workflow records the build environment hash and embeds it in the manifest.
- A
--reproduciblebuild flag uses a deterministic source date and a fixed PYTHONHASHSEED.
6. Signed releases¶
- GitHub releases must publish Sigstore-backed build provenance.
- macOS binaries are notarized and Linux binaries are signed with the project's GPG key when those packaged binaries are cut.
- The signing key fingerprints are published in
SECURITY.mdat the repo root.
7. Dependency hygiene¶
- Direct dependencies are pinned to a minimum version with a justified upper bound only when required.
- A nightly CI job runs
pip-audit/safety check. Any new advisory blocks a release until triaged. - Transitive lockfile is committed for the release process.
8. Failure containment¶
- Network code paths are confined to
geno_lewm/deploy/runtime.pyandgeno_lewm/cli/update.py. A custom AST check fails CI if any other module importsurllib,httpx,requests, or similar. - Disk-write paths are similarly confined.
Secrets handling¶
- The released runtime must require no API keys. Hugging Face Hub downloads use anonymous tokens unless the user explicitly authenticates.
- Authentication tokens, when supplied, are read from
HF_TOKENand never logged or stored in receipts. - No GenoLeWM artifact contains credentials.
Cryptographic primitives¶
| Use | Primitive | Notes |
|---|---|---|
| Content hashes (weights, windows, manifest) | SHA-256 | from hashlib |
| Canonical JSON serialization | sorted-keys UTF-8 no-whitespace | helper in provenance/hashing.py |
No custom crypto is implemented.
Disclosure policy¶
- Security issues are reported privately via GitHub Security Advisories
or
security@<project domain>(seeSECURITY.md). - Acknowledgement target: 72 hours.
- Triaged-fix target: 30 days for high; 90 days for medium.
- Coordinated disclosure with an embargo window of up to 90 days.
- Public advisories follow CVE numbering when applicable.
Safety boundaries¶
Independent of confidentiality, the project enforces safety boundaries on use:
- The released runtime, CLI, and desktop app must surface a permanent, non-dismissible "research tool — not clinical" banner.
- The CLI emits a warning when ClinVar P/LP variants are detected, pointing to clinical follow-up.
- The license addendum (if adopted; see OQ-OVR-2) restricts germline-edit reproductive use.
Invariants¶
| ID | Invariant | Enforced by |
|---|---|---|
| INV-SEC-1 | No inference path calls the network after first-run setup | runtime guard + CI integration test |
| INV-SEC-2 | Every loaded weight file matches its hash in the manifest | loader call-site in deploy/runtime.py |
| INV-SEC-3 | Receipts produced for the same (model_id, input) agree on supported backends |
conformance test |
| INV-SEC-4 | DNA strings of length ≥ 20 bp never appear in logs | property test |
| INV-SEC-5 | Network paths exist only in deploy/runtime.py and cli/update.py |
AST check |
| INV-SEC-6 | No credentials are written to disk by the runtime | dedicated test |
| INV-SEC-7 | Personal-data fields are dropped before any non-local sink | redaction unit tests |
Open questions¶
| ID | Question | Owner | Target |
|---|---|---|---|
| OQ-SEC-1 | Resolved: geno_lewm.provenance is the preferred public path; the legacy import path is removed |
core | done |
| OQ-SEC-2 | License addendum forbidding germline-edit reproductive use vs README-only disclaimer | core | before v0.1 tag |
| OQ-SEC-3 | Whether to provide a --audit-log mode that records all file accesses for forensic review |
core | post-v1 |
| OQ-SEC-4 | Mechanism for revoking a published model_id when a serious bug is found (revocation list?) | core | before first model release |