The earlier EUPE ONNX export used in this deck was broken and produced misleading geometry claims.
The corrected export and refreshed compare artifacts now live in demo/reports/eupe-vs-ssl-reference.html and demo/reports/eupe-compare.json.
Open sample compare report · Browse reports
The deck below has been rewritten around the corrected benchmark and revised interpretation.
A deep dive into self-supervised vision model representations
latent-inspector | Rust + ONNX Runtime
github.com/AbdelStark/latent-inspector
Act I
One photograph. 224 x 224 pixels. Three color channels. Every model sees the same pixels.
Top 3 PCA components mapped to RGB
Act II
Learning to see without being told what to look at
Supervised
Human labels for every image
ImageNet: 14M images, years of work
Millions of dollars
Self-Supervised
No labels needed
The internet: billions of images
Zero annotation cost
The trick: invent a task that requires no labels
but forces the model to understand the image's structure.
Self-distillation: "Two views of the same image. Produce the same representation for both."
Latent prediction: "I masked part of the image. Predict the representation of the hidden part."
Video prediction: "Predict the representation of the next frame."
Proxy distillation: "First build one large proxy teacher from multiple experts, then compress that proxy into a small generalist student."
The Cast
DINOv2 ViT-L/14 · 304M · 1024-dim
Self-distillation. Student matches a slowly-evolving teacher across augmented views.
The model learns that objects are the things that stay stable when everything else changes.
Meta FAIR · Oquab et al., 2023
I-JEPA ViT-H/14 · 632M · 1280-dim
Masks large image regions. Predicts the representation of missing patches, not pixels.
Predicts meaning, not appearance. Every patch must encode unique spatial context.
Meta FAIR · Assran et al., 2023 · Yann LeCun's JEPA
V-JEPA 2 ViT-L/16 · 304M · 1024-dim
Video prediction in latent space. Predicts future frames, not pixels.
Even on a static photo, carries an implicit prior about motion and time.
Meta FAIR · Bardes et al., 2025
EUPE ViT-B/16 · 86M · 768-dim
Proxy-distilled from a 1.9B universal teacher that aggregates multiple specialist teachers.
Compact generalist. Lower-rank, more top-heavy, and more locally coherent than the SSL-only models.
Meta FAIR · Zhu et al., 2026
The model families differ too, but the biggest point still holds: training pressure reshapes representation geometry.
Let's measure how that single choice
reshapes the entire geometry of the representation.
Act III
cargo install latent-inspector
Rust single binary, no Python env
ONNX Runtime real inference, verified models
Validated SHA-256 checksums, golden references
compare cross-model metrics + matrices
inspect single-model deep diagnostics
tui interactive terminal dashboard
validate model integrity verification
Act IV
Image 224 x 224 px, 3 channels
↓ split into 14 x 14 pixel patches
256 patches
↓ linear projection to 1024 dimensions
256 vectors x 1024 numbers
↓ 24 Transformer layers (self-attention + FFN)
256 refined vectors = the representation
262,144 floating-point numbers. That's what we analyze.
1024 dimensions is too many to visualize. PCA finds the directions of maximum variation.
Map the top 3 directions to Red, Green, Blue channels.
Same-colored regions = the model considers those patches similar.
Colors are relative to each model. You cannot compare "red" across models.
Act V
DINOv2
Sharp boundaries. The elephant body clusters in one color. Background in another.
Self-distillation forces consistency across augmented views. The most consistent thing across crops, rotations, and color shifts is the object itself.
I-JEPA
More colors. Trunk, legs, ears each distinct. Background has spatial structure.
The prediction objective forces each patch to be unique. If adjacent patches had identical representations, the model couldn't predict which one is missing.
V-JEPA 2
Much cleaner local continuity once the still image is adapted through the 16-frame evaluation path.
The corrected image wrapper keeps the video prior, but it now looks like a coherent image representation rather than a distorted 2-frame surrogate.
EUPE
Sharper grouping and much stronger local agreement than the other three models.
The corrected export still shows a compressed representation, but it remains clearly image-dependent and structurally meaningful.
Let's quantify exactly how different.
Act VI
How many dimensions the model actually uses. Like a 1024-channel mixing board — how many channels carry signal?
| Model | Effective Rank | Of |
|---|---|---|
| DINOv2 | 60 | 1024 |
| I-JEPA | 44 | 1280 |
| V-JEPA 2 | 51 | 1024 |
| EUPE | 22 | 768 |
DINOv2 stays the broadest spread. EUPE remains the most concentrated model in this four-model set.
How differentiated are the patches? High = every patch says something unique.
| Model | Entropy | |
|---|---|---|
| DINOv2 | 2.52 | |
| I-JEPA | 2.89 | every patch is unique |
| V-JEPA 2 | 2.89 | high variation once the image path is corrected |
| EUPE | 2.83 | compact, still differentiated |
I-JEPA must differentiate to predict. EUPE stays fairly expressive on this metric even while compressing variance much more aggressively elsewhere.
Act VII
Do patches point in diverse directions — or all the same way?
Picture a room of 256 compasses.
1.0 = every compass points a different direction. 0.0 = they all point north.
Low, but not near zero.
EUPE still uses fewer directions than the SSL-only models, but the corrected export shows a compressed representation, not a degenerate one.
That matches the surviving qualitative story: sharper, more top-heavy features and much stronger local agreement.
Top-10 variance: 87.0%
still by far the most top-heavy model
Components @ 90%: 13
vs 31, 22, and 29 for the others
Act VIII
CKA — Centered Kernel Alignment
Do two models organize their representations in similar geometric structures?
Like two restaurant critics: do they agree on which restaurants are similar to each other?
1.000 = identical geometry 0.000 = completely unrelated
| DINOv2 | I-JEPA | V-JEPA 2 | EUPE | |
|---|---|---|---|---|
| DINOv2 | 1.000 | 0.329 | 0.495 | 0.150 |
| I-JEPA | 0.329 | 1.000 | 0.381 | 0.115 |
| V-JEPA 2 | 0.495 | 0.381 | 1.000 | 0.103 |
| EUPE | 0.150 | 0.115 | 0.103 | 1.000 |
The corrected image path pulls V-JEPA 2 much closer to DINOv2 and I-JEPA. EUPE is still the weakest match to the others, but now clearly as a coherent compressed outlier rather than an export artifact.
1. DINOv2 ↔ V-JEPA 2 = 0.495 — highest pair
The corrected 16-frame image path reveals that V-JEPA 2 is much closer to DINOv2 on still images than the retired 2-frame surrogate implied.
2. I-JEPA ↔ V-JEPA 2 = 0.381
V-JEPA 2 still keeps a video-shaped bias, but on images it now sits much closer to the two SSL image encoders than to the old surrogate geometry.
3. EUPE stays weakest against everyone: 0.150 / 0.115 / 0.103
The stronger surviving EUPE signal is compression, not total disagreement. The gap is real; the earlier near-zero magnitude was artifact-driven.
The 86M student does not directly distill from multiple teachers at once.
It distills from a merged 1.9B proxy teacher, and the paper explicitly compares against the direct multi-teacher baseline.
That makes the corrected CKA numbers easier to read: EUPE reorganizes the geometry substantially, but it remains coherent.
For each patch, find its 10 nearest neighbors in each model. What fraction do they share?
| DINOv2 | I-JEPA | V-JEPA 2 | EUPE | |
|---|---|---|---|---|
| DINOv2 | 1.000 | 0.278 | 0.366 | 0.168 |
| I-JEPA | 0.278 | 1.000 | 0.311 | 0.122 |
| V-JEPA 2 | 0.366 | 0.311 | 1.000 | 0.226 |
| EUPE | 0.168 | 0.122 | 0.226 | 1.000 |
36.6% DINOv2 ↔ V-JEPA 2 — highest
12.2% I-JEPA ↔ EUPE — lowest
The corrected adapter changes the local story as well: V-JEPA 2 now shares many more neighborhoods with DINOv2 and I-JEPA than the 2-frame surrogate suggested. This is patch-neighborhood overlap on one image, not the paper's ImageNet k-NN classification metric.
Act IX
latent-inspector inspect elephant.jpg --model dinov2-vit-l14
Full diagnostics: PCA variance spectrum, patch norm distributions,
CLS token analysis, all metrics in one view.
latent-inspector tui elephant.jpg \
-m dinov2-vit-l14,ijepa-vit-h14,vjepa2-vitl-img16-256,eupe-vit-b16
Dashboard · Inspector · Compare · Spectrum
All interactive. All in the terminal.
latent-inspector validate --model dinov2-vit-l14 --model ijepa-vit-h14 \
--model vjepa2-vitl-img16-256 --model eupe-vit-b16
Preprocessing contracts. Golden references. Zero drift.
Not vibes — verifiable measurements.
Act X
1 Training objective — the dominant force
2 Architecture — the container that constrains geometry
3 Modality — image vs video matters more than paradigm
4 Model size — the compression budget
If you're building a system that needs to understand the physical world — a robot, an autonomous vehicle, a world model — the question isn't just "which model gets the best accuracy."
What kind of perception does this model have?
OPEN SOURCE · RUST · RUNS IN SECONDS
cargo install latent-inspector
Sample compare report · Reports index
github.com/AbdelStark/latent-inspector