geno_lewm.surprise.context¶
context
¶
Context stratification labels for RFC-0009 calibration buckets.
REGION_CLASSES
module-attribute
¶
REGION_CLASSES: tuple[str, ...] = ('coding_synonymous', 'coding_missense', 'coding_nonsense', 'splice', 'utr5', 'utr3', 'intron', 'promoter', 'enhancer', 'intergenic', 'other')
Canonical RFC-0009 region_class values.
GC_BINS
module-attribute
¶
Canonical RFC-0009 gc_bin values.
REPEAT_CLASSES
module-attribute
¶
REPEAT_CLASSES: tuple[str, ...] = ('none', 'simple', 'low_complexity', 'transposon', 'segmental_dup')
Canonical RFC-0009 repeat_class values.
UNKNOWN_BUCKET_ID
module-attribute
¶
Catch-all calibration bucket reached after every parent bucket is sparse.
DEFAULT_GC_LOW_CUTOFF
module-attribute
¶
Inclusive lower-tercile GC cutoff used when no fitted cutpoints are supplied.
DEFAULT_GC_HIGH_CUTOFF
module-attribute
¶
Inclusive upper-tercile GC cutoff used when no fitted cutpoints are supplied.
DEFAULT_MIN_BUCKET_SIZE
module-attribute
¶
RFC-0009 default threshold for a well-populated calibration bucket.
classify_context
¶
classify_context(*, region: str | Sequence[str] | None, gc_window: str, repeat: str | Sequence[str] | None = None, low_gc_cutoff: float = DEFAULT_GC_LOW_CUTOFF, high_gc_cutoff: float = DEFAULT_GC_HIGH_CUTOFF) -> ContextLabel
Build a canonical context label from annotation terms and a DNA window.
region and repeat accept upstream annotation labels such as
VEP/SnpEff consequences or repeat-masker class strings. gc_window
is the sequence window around the variant locus.
Source code in geno_lewm/surprise/context.py
classify_region
¶
Return the canonical region_class for annotation term(s).
Source code in geno_lewm/surprise/context.py
classify_repeat
¶
Return the canonical repeat_class for repeat annotation term(s).
Source code in geno_lewm/surprise/context.py
gc_fraction
¶
Return GC fraction over called A/C/G/T bases in sequence.
N bases are valid in reference windows but are excluded from the
denominator because their GC status is unknown. A window containing
no called bases is rejected.
Source code in geno_lewm/surprise/context.py
classify_gc_bin
¶
classify_gc_bin(sequence: str, *, low_cutoff: float = DEFAULT_GC_LOW_CUTOFF, high_cutoff: float = DEFAULT_GC_HIGH_CUTOFF) -> str
Return low, mid, or high for a DNA window's GC fraction.
Source code in geno_lewm/surprise/context.py
make_bucket_id
¶
Return the stable full calibration bucket ID for a context tuple.
Source code in geno_lewm/surprise/context.py
backoff_chain
¶
Return fixed parent-bucket IDs ending in *.
Full buckets back off as region|gc|repeat -> region|gc ->
region -> *. Parent buckets can also be passed directly.
Source code in geno_lewm/surprise/context.py
select_backoff_bucket
¶
select_backoff_bucket(label_or_bucket: ContextLabel | str, bucket_sizes: Mapping[str, int], *, min_count: int = DEFAULT_MIN_BUCKET_SIZE) -> str
Return the first bucket in the backoff chain with enough calibration rows.
If every specific parent is sparse, the catch-all * bucket is
returned. Downstream calibration code can still report low
confidence based on that bucket's own count.