EUPE Notice

The earlier EUPE ONNX export used in this deck was broken and produced misleading geometry claims.

The corrected export and refreshed compare artifacts now live in demo/reports/eupe-vs-ssl-reference.html and demo/reports/eupe-compare.json.

Open sample compare report · Browse reports

The deck below has been rewritten around the corrected benchmark and revised interpretation.

How AI Models
See the World

A deep dive into self-supervised vision model representations

latent-inspector | Rust + ONNX Runtime

github.com/AbdelStark/latent-inspector

Act I

Same Image.
Four Models.
Four Realities.

The Input

African elephant on savanna

One photograph. 224 x 224 pixels. Three color channels. Every model sees the same pixels.

What each model sees

Top 3 PCA components mapped to RGB

DINOv2 PCA
DINOv2 — clean object segmentation
I-JEPA PCA
I-JEPA — fine-grained spatial detail
V-JEPA 2 PCA
V-JEPA 2 — strong spatial coherence from the corrected 16-frame image path
EUPE PCA
EUPE — compact, sharper grouping

Act II

Self-Supervised Learning

Learning to see without being told what to look at

The bottleneck

Supervised

Human labels for every image

ImageNet: 14M images, years of work

Millions of dollars

Self-Supervised

No labels needed

The internet: billions of images

Zero annotation cost

The trick: invent a task that requires no labels
but forces the model to understand the image's structure.

Different questions, different understanding

Self-distillation: "Two views of the same image. Produce the same representation for both."

Latent prediction: "I masked part of the image. Predict the representation of the hidden part."

Video prediction: "Predict the representation of the next frame."

Proxy distillation: "First build one large proxy teacher from multiple experts, then compress that proxy into a small generalist student."

Each question creates a different learning pressure.
That pressure sculpts the geometry of the representation.

The Cast

Four Models

DINOv2 ViT-L/14 · 304M · 1024-dim

Self-distillation. Student matches a slowly-evolving teacher across augmented views.

The model learns that objects are the things that stay stable when everything else changes.

Meta FAIR · Oquab et al., 2023

I-JEPA ViT-H/14 · 632M · 1280-dim

Masks large image regions. Predicts the representation of missing patches, not pixels.

Predicts meaning, not appearance. Every patch must encode unique spatial context.

Meta FAIR · Assran et al., 2023 · Yann LeCun's JEPA

V-JEPA 2 ViT-L/16 · 304M · 1024-dim

Video prediction in latent space. Predicts future frames, not pixels.

Even on a static photo, carries an implicit prior about motion and time.

Meta FAIR · Bardes et al., 2025

EUPE ViT-B/16 · 86M · 768-dim

Proxy-distilled from a 1.9B universal teacher that aggregates multiple specialist teachers.

Compact generalist. Lower-rank, more top-heavy, and more locally coherent than the SSL-only models.

Meta FAIR · Zhu et al., 2026

Same image. Different training pressures.

The model families differ too, but the biggest point still holds: training pressure reshapes representation geometry.

Let's measure how that single choice
reshapes the entire geometry of the representation.

Act III

The Instrument

cargo install latent-inspector

Rust single binary, no Python env

ONNX Runtime real inference, verified models

Validated SHA-256 checksums, golden references

compare cross-model metrics + matrices

inspect single-model deep diagnostics

tui interactive terminal dashboard

validate model integrity verification

switch to terminal: latent-inspector models --verbose

Act IV

What is a Representation?

From pixels to vectors

Image 224 x 224 px, 3 channels

  ↓  split into 14 x 14 pixel patches

256 patches

  ↓  linear projection to 1024 dimensions

256 vectors x 1024 numbers

  ↓  24 Transformer layers (self-attention + FFN)

256 refined vectors = the representation

262,144 floating-point numbers. That's what we analyze.

PCA — Principal Component Analysis

1024 dimensions is too many to visualize. PCA finds the directions of maximum variation.

Map the top 3 directions to Red, Green, Blue channels.

Same-colored regions = the model considers those patches similar.

Colors are relative to each model. You cannot compare "red" across models.

Act V

How Each Model Sees

DINOv2 PCA

DINOv2

Emergent Segmentation

Sharp boundaries. The elephant body clusters in one color. Background in another.

Self-distillation forces consistency across augmented views. The most consistent thing across crops, rotations, and color shifts is the object itself.

I-JEPA PCA

I-JEPA

Fine-Grained Detail

More colors. Trunk, legs, ears each distinct. Background has spatial structure.

The prediction objective forces each patch to be unique. If adjacent patches had identical representations, the model couldn't predict which one is missing.

V-JEPA 2 PCA

V-JEPA 2

Spatiotemporal Coherence

Much cleaner local continuity once the still image is adapted through the 16-frame evaluation path.

The corrected image wrapper keeps the video prior, but it now looks like a coherent image representation rather than a distorted 2-frame surrogate.

EUPE PCA

EUPE

Compact Compression

Sharper grouping and much stronger local agreement than the other three models.

The corrected export still shows a compressed representation, but it remains clearly image-dependent and structurally meaningful.

Let's quantify exactly how different.

Act VI

Measuring Representations

switch to terminal: latent-inspector compare

Effective Rank

How many dimensions the model actually uses. Like a 1024-channel mixing board — how many channels carry signal?

ModelEffective RankOf
DINOv2601024
I-JEPA441280
V-JEPA 2511024
EUPE22768

DINOv2 stays the broadest spread. EUPE remains the most concentrated model in this four-model set.

Patch Entropy

How differentiated are the patches? High = every patch says something unique.

ModelEntropy
DINOv22.52
I-JEPA2.89every patch is unique
V-JEPA 22.89high variation once the image path is corrected
EUPE2.83compact, still differentiated

I-JEPA must differentiate to predict. EUPE stays fairly expressive on this metric even while compressing variance much more aggressively elsewhere.

Act VII

Isotropy

Do patches point in diverse directions — or all the same way?

Picture a room of 256 compasses.
1.0 = every compass points a different direction. 0.0 = they all point north.

0.796
DINOv2
0.788
I-JEPA
0.417
V-JEPA 2
0.375
EUPE
0.375

Low, but not near zero.

EUPE still uses fewer directions than the SSL-only models, but the corrected export shows a compressed representation, not a degenerate one.

That matches the surviving qualitative story: sharper, more top-heavy features and much stronger local agreement.

Top-10 variance: 87.0%

still by far the most top-heavy model

Components @ 90%: 13

vs 31, 22, and 29 for the others

Act VIII

Cross-Model Similarity

CKA — Centered Kernel Alignment

What CKA measures

Do two models organize their representations in similar geometric structures?

Like two restaurant critics: do they agree on which restaurants are similar to each other?

1.000 = identical geometry     0.000 = completely unrelated

The CKA Matrix

DINOv2I-JEPAV-JEPA 2EUPE
DINOv21.0000.3290.4950.150
I-JEPA0.3291.0000.3810.115
V-JEPA 20.4950.3811.0000.103
EUPE0.1500.1150.1031.000

The corrected image path pulls V-JEPA 2 much closer to DINOv2 and I-JEPA. EUPE is still the weakest match to the others, but now clearly as a coherent compressed outlier rather than an export artifact.

Three findings

1. DINOv2 ↔ V-JEPA 2 = 0.495 — highest pair

The corrected 16-frame image path reveals that V-JEPA 2 is much closer to DINOv2 on still images than the retired 2-frame surrogate implied.

2. I-JEPA ↔ V-JEPA 2 = 0.381

V-JEPA 2 still keeps a video-shaped bias, but on images it now sits much closer to the two SSL image encoders than to the old surrogate geometry.

3. EUPE stays weakest against everyone: 0.150 / 0.115 / 0.103

The stronger surviving EUPE signal is compression, not total disagreement. The gap is real; the earlier near-zero magnitude was artifact-driven.

The Actual Distillation Story

The 86M student does not directly distill from multiple teachers at once.

It distills from a merged 1.9B proxy teacher, and the paper explicitly compares against the direct multi-teacher baseline.

That makes the corrected CKA numbers easier to read: EUPE reorganizes the geometry substantially, but it remains coherent.

k-NN Overlap — Local Neighborhood Agreement

For each patch, find its 10 nearest neighbors in each model. What fraction do they share?

DINOv2I-JEPAV-JEPA 2EUPE
DINOv21.0000.2780.3660.168
I-JEPA0.2781.0000.3110.122
V-JEPA 20.3660.3111.0000.226
EUPE0.1680.1220.2261.000

36.6% DINOv2 ↔ V-JEPA 2 — highest

12.2% I-JEPA ↔ EUPE — lowest

The corrected adapter changes the local story as well: V-JEPA 2 now shares many more neighborhoods with DINOv2 and I-JEPA than the 2-frame surrogate suggested. This is patch-neighborhood overlap on one image, not the paper's ImageNet k-NN classification metric.

Act IX

The Toolkit

switch to terminal for live demos

Single Model Deep-Dive

latent-inspector inspect elephant.jpg --model dinov2-vit-l14

Full diagnostics: PCA variance spectrum, patch norm distributions,
CLS token analysis, all metrics in one view.

Interactive Terminal UI

latent-inspector tui elephant.jpg \
  -m dinov2-vit-l14,ijepa-vit-h14,vjepa2-vitl-img16-256,eupe-vit-b16

Dashboard · Inspector · Compare · Spectrum
All interactive. All in the terminal.

Validation Pipeline

latent-inspector validate --model dinov2-vit-l14 --model ijepa-vit-h14 \
  --model vjepa2-vitl-img16-256 --model eupe-vit-b16
DINOv2 · 73 signals I-JEPA · 45 signals V-JEPA 2 · 45 signals EUPE · 73 signals

Preprocessing contracts. Golden references. Zero drift.
Not vibes — verifiable measurements.

Act X

What Shapes Perception?

The hierarchy of forces

1  Training objective — the dominant force

2  Architecture — the container that constrains geometry

3  Modality — image vs video matters more than paradigm

4  Model size — the compression budget

Different World Models.
Different Ways of Seeing.

If you're building a system that needs to understand the physical world — a robot, an autonomous vehicle, a world model — the question isn't just "which model gets the best accuracy."

What kind of perception does this model have?

OPEN SOURCE · RUST · RUNS IN SECONDS

latent-inspector

cargo install latent-inspector
DINOv2 I-JEPA V-JEPA 2 EUPE MAE · CLIP · SigLIP · DINOv3 coming

Sample compare report · Reports index

github.com/AbdelStark/latent-inspector