geno_lewm.surprise¶
surprise
¶
Surprise-scoring helpers for RFC-0009.
CALIBRATION_SCHEMA_VERSION
module-attribute
¶
On-disk calibration table schema version.
DEFAULT_CDF_POINTS
module-attribute
¶
Number of points in each empirical CDF grid.
DEFAULT_REFERENCE_PER_BUCKET
module-attribute
¶
Default maximum number of reference variants sampled per bucket.
LOW_CONFIDENCE_BUCKET_SIZE
module-attribute
¶
Buckets below this size are marked low-confidence by RFC-0009.
DEFAULT_GC_HIGH_CUTOFF
module-attribute
¶
Inclusive upper-tercile GC cutoff used when no fitted cutpoints are supplied.
DEFAULT_GC_LOW_CUTOFF
module-attribute
¶
Inclusive lower-tercile GC cutoff used when no fitted cutpoints are supplied.
DEFAULT_MIN_BUCKET_SIZE
module-attribute
¶
RFC-0009 default threshold for a well-populated calibration bucket.
GC_BINS
module-attribute
¶
Canonical RFC-0009 gc_bin values.
REGION_CLASSES
module-attribute
¶
REGION_CLASSES: tuple[str, ...] = ('coding_synonymous', 'coding_missense', 'coding_nonsense', 'splice', 'utr5', 'utr3', 'intron', 'promoter', 'enhancer', 'intergenic', 'other')
Canonical RFC-0009 region_class values.
REPEAT_CLASSES
module-attribute
¶
REPEAT_CLASSES: tuple[str, ...] = ('none', 'simple', 'low_complexity', 'transposon', 'segmental_dup')
Canonical RFC-0009 repeat_class values.
UNKNOWN_BUCKET_ID
module-attribute
¶
Catch-all calibration bucket reached after every parent bucket is sparse.
Aggregation
module-attribute
¶
Supported aggregation modes for multi-step predictor outputs.
CalibrationBucket
dataclass
¶
CalibrationExample
dataclass
¶
One pre-scored reference variant used to build calibration CDFs.
CalibrationTable
dataclass
¶
CalibrationTable(buckets: tuple[CalibrationBucket, ...], warnings: tuple[CalibrationWarning, ...] = (), schema_version: str = CALIBRATION_SCHEMA_VERSION)
In-memory representation of calibration.parquet.
get
¶
Return a bucket by ID, or None if absent.
require
¶
Return a bucket by ID, raising InputError when absent.
Source code in geno_lewm/surprise/calibration.py
resolve
¶
resolve(label_or_bucket: str, *, min_bucket_size: int = DEFAULT_MIN_BUCKET_SIZE) -> CalibrationBucket
Resolve a sparse bucket through the table's fixed backoff chain.
Source code in geno_lewm/surprise/calibration.py
CalibrationWarning
dataclass
¶
CalibrationWarning(bucket_id: str, resolved_bucket_id: str, n_calibration: int, min_bucket_size: int, low_confidence: bool)
Sparse-bucket warning emitted while building a calibration table.
SurpriseResult
dataclass
¶
SurpriseResult(sigma_raw: float, sigma_calibrated: float, bucket_id: str, confidence: float, low_confidence: bool)
Calibrated surprise score for one edit.
to_dict
¶
Return a JSON-native payload for CLI and JSONL outputs.
Source code in geno_lewm/surprise/score.py
build_calibration_table
¶
build_calibration_table(examples: Iterable[CalibrationExample], *, seed: int = 0, per_bucket_sample: int = DEFAULT_REFERENCE_PER_BUCKET, grid_size: int = DEFAULT_CDF_POINTS, min_bucket_size: int = DEFAULT_MIN_BUCKET_SIZE, low_confidence_size: int = LOW_CONFIDENCE_BUCKET_SIZE, warn_sparse: bool = True) -> CalibrationTable
Build deterministic empirical CDF buckets from pre-scored examples.
Source code in geno_lewm/surprise/calibration.py
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 | |
read_calibration_table
¶
Read and validate a calibration Parquet file.
Source code in geno_lewm/surprise/calibration.py
write_calibration_table
¶
Write a calibration table to calibration.parquet.
Source code in geno_lewm/surprise/calibration.py
backoff_chain
¶
Return fixed parent-bucket IDs ending in *.
Full buckets back off as region|gc|repeat -> region|gc ->
region -> *. Parent buckets can also be passed directly.
Source code in geno_lewm/surprise/context.py
classify_context
¶
classify_context(*, region: str | Sequence[str] | None, gc_window: str, repeat: str | Sequence[str] | None = None, low_gc_cutoff: float = DEFAULT_GC_LOW_CUTOFF, high_gc_cutoff: float = DEFAULT_GC_HIGH_CUTOFF) -> ContextLabel
Build a canonical context label from annotation terms and a DNA window.
region and repeat accept upstream annotation labels such as
VEP/SnpEff consequences or repeat-masker class strings. gc_window
is the sequence window around the variant locus.
Source code in geno_lewm/surprise/context.py
classify_gc_bin
¶
classify_gc_bin(sequence: str, *, low_cutoff: float = DEFAULT_GC_LOW_CUTOFF, high_cutoff: float = DEFAULT_GC_HIGH_CUTOFF) -> str
Return low, mid, or high for a DNA window's GC fraction.
Source code in geno_lewm/surprise/context.py
classify_region
¶
Return the canonical region_class for annotation term(s).
Source code in geno_lewm/surprise/context.py
classify_repeat
¶
Return the canonical repeat_class for repeat annotation term(s).
Source code in geno_lewm/surprise/context.py
gc_fraction
¶
Return GC fraction over called A/C/G/T bases in sequence.
N bases are valid in reference windows but are excluded from the
denominator because their GC status is unknown. A window containing
no called bases is rejected.
Source code in geno_lewm/surprise/context.py
make_bucket_id
¶
Return the stable full calibration bucket ID for a context tuple.
Source code in geno_lewm/surprise/context.py
select_backoff_bucket
¶
select_backoff_bucket(label_or_bucket: ContextLabel | str, bucket_sizes: Mapping[str, int], *, min_count: int = DEFAULT_MIN_BUCKET_SIZE) -> str
Return the first bucket in the backoff chain with enough calibration rows.
If every specific parent is sparse, the catch-all * bucket is
returned. Downstream calibration code can still report low
confidence based on that bucket's own count.
Source code in geno_lewm/surprise/context.py
score_variant
¶
score_variant(variant: EditSpec, encoder: object, action_encoder: object, predictor: object, calibration: CalibrationTable, *, reference_window: str, window_start_bp: int = 0, region: str | Sequence[str] | None = None, repeat: str | Sequence[str] | None = None, aggregation: str = 'mean', min_bucket_size: int = DEFAULT_MIN_BUCKET_SIZE) -> SurpriseResult
Score one edit against a caller-supplied reference window.
The scorer is intentionally model-object agnostic: callers can pass
the concrete training-time modules or small deterministic fakes.
FASTA-backed window extraction is available through :func:score_vcf;
checkpoint loading is owned by higher runtime layers.
Source code in geno_lewm/surprise/score.py
score_vcf
¶
score_vcf(vcf_path: str | Path, encoder: object, action_encoder: object, predictor: object, calibration: CalibrationTable, output_path: str | Path, *, reference_windows: Mapping[str, str] | None = None, reference_fasta: str | Path | None = None, window_bp: int = DEFAULT_WINDOW_BP, window_start_bp: int = 0, region: str | Sequence[str] | None = None, repeat: str | Sequence[str] | None = None, aggregation: str = 'mean', show_progress: bool = True, batch_size: int = 64, min_bucket_size: int = DEFAULT_MIN_BUCKET_SIZE) -> Path
Score VCF rows and write one JSON object per scored alternate.
Pass reference_fasta for local FASTA-backed window extraction.
reference_windows remains useful for tests and already-extracted
windows. Mapping keys are tried in this order:
chrom:pos:ref:alt, chrom:pos, then chrom.