geno_lewm.data.clinvar¶
clinvar
¶
ClinVar local VCF preparation and shard loading.
ClinvarVariant
dataclass
¶
ClinvarVariant(chrom: str, pos: int, ref: str, alt: str, clinical_significance: str, review_status: str, gene_symbol: str | None, clinvar_id: int, schema_version: str = CLINVAR_SCHEMA_VERSION)
One normalized ClinVar row.
ClinvarPrepareReport
dataclass
¶
ClinvarPrepareReport(output_path: Path, release: str, records_read: int, allele_records_seen: int, records_written: int, skipped_allele: int, size_bytes: int, already_exists: bool = False)
Summary emitted by geno-lewm-prepare-clinvar.
prepare_clinvar_shard
¶
prepare_clinvar_shard(input_vcf: str | Path, output_dir: str | Path, *, release: str, max_allele_len: int = 16, overwrite: bool = False) -> ClinvarPrepareReport
Normalize a local ClinVar VCF/VCF.gz into the release shard schema.
Source code in geno_lewm/data/clinvar.py
iter_clinvar_vcf_variants
¶
iter_clinvar_vcf_variants(input_vcf: str | Path, *, max_allele_len: int = 16) -> Iterator[ClinvarVariant]
Yield normalized ClinVar rows from a local VCF without writing a shard.
Source code in geno_lewm/data/clinvar.py
iter_clinvar_shard
¶
Yield normalized ClinVar rows from a Parquet shard.
Source code in geno_lewm/data/clinvar.py
label_set
¶
Return ClinVar rows usable for labelled eval, excluding VUS/OTHER.