Self-supervised JEPA training & inference in your browser, using the deterministic CPU-backed WASM path today
┌────────────────┐
x_context ─► Context │
│ Encoder (θ) ├─► s_x ──┐
└────────────────┘ │
▼
┌──────────┐
z (opt.) ─► │
│ Predictor├─► ŝ_y ──┐
target_positions ─►│ │ │
└──────────┘ │ ┌──────────┐
├──► EnergyFn │─► loss
┌────────────────┐ │ └──────────┘
x_target ─► Target │ │
│ Encoder (ξ) ├─► s_y ─────────────────┘
└────────────────┘
↑
│ EMA(θ → ξ)
Vision Transformer (ViT) that encodes visible patches with gradient flow. Only sees unmasked context tokens — target patches are removed before self-attention.
crates/jepa-vision/src/vit.rs → VitEncoder
Same ViT architecture as the context encoder, but weights are updated via Exponential Moving Average (EMA) — no gradients flow through this path.
crates/jepa-core/src/ema.rs → Ema
Narrow transformer that predicts target representations from context embeddings using position-conditioned prediction tokens.
crates/jepa-vision/src/image.rs → TransformerPredictor
Generates contiguous rectangular blocks of masked target patches on the 2D patch grid, ensuring context and target tokens are disjoint.
crates/jepa-core/src/masking.rs → BlockMasking
Measures prediction quality in representation space. L2, cosine, and smooth L1 distances are supported.
crates/jepa-core/src/energy.rs → L2Energy
Prevents representation collapse via VICReg or Barlow Twins loss terms that encourage variance and decorrelation.
crates/jepa-core/src/collapse.rs → VICReg