geno_lewm.action.synthetic¶
synthetic
¶
Synthetic edit samplers (RFC-0003 §3.8).
These samplers produce :class:RelEdit objects keyed to an existing
window string. Used by the training data pipeline (RFC-0006 §3.4) to
ensure uniform action-space coverage when natural variants are sparse
in a given region.
All samplers are deterministic with respect to a seeded
:class:random.Random instance (passed in as rng), so training
runs are reproducible end-to-end (RFC-0005 §3.6).
A minimum distance from each window edge is enforced (edge_margin,
default 64 bp). This guarantees the pooling step has enough context on
both sides of the edit, matching the encoder's pooling assumptions
(RFC-0002 §3.4).
uniform_snv
¶
uniform_snv(window: str, n: int, *, rng: Random, edge_margin: int = DEFAULT_EDGE_MARGIN) -> list[RelEdit]
Sample n uniform SNVs anchored inside window.
Each SNV's alt is uniformly drawn from the three non-reference
bases at the chosen position, so the contract "alt is always
non-reference" is enforced by construction.
Returns edits in the order they were sampled. The list may contain duplicates by position — the caller (data pipeline) is responsible for deduplication if it needs disjoint edits.
Source code in geno_lewm/action/synthetic.py
indel
¶
indel(window: str, n: int, *, rng: Random, length_dist: Mapping[int, float] | Sequence[float] | None = None, type_mix: tuple[float, float] = (0.5, 0.5), edge_margin: int = DEFAULT_EDGE_MARGIN) -> list[RelEdit]
Sample n indels (INS or DEL).
length_dist is the event length (number of bases inserted or
deleted, exclusive of the VCF anchor base). Default is a truncated
geometric over [1, V1_MAX_LEN-1].
type_mix is (p_ins, p_del). Default 50/50.
Source code in geno_lewm/action/synthetic.py
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 | |
mnv
¶
mnv(window: str, n: int, *, rng: Random, length_dist: Mapping[int, float] | Sequence[float] | None = None, edge_margin: int = DEFAULT_EDGE_MARGIN) -> list[RelEdit]
Sample n MNVs (length-preserving multi-base substitutions).
Length is drawn from length_dist (default uniform over [2, 8]
per RFC text). The alt is guaranteed different from ref at every
base (otherwise constructing a RelEdit with that ref/alt would be
rejected by EditSpec validation).