JEPA

Joint Embedding Predictive Architecture

Papers

1. Intuitive Physics understanding from Videos

Intuitive physics understanding emerges from self-supervised pretraining on natural videos [arXiv] [pdf]

Previous work on Physics Understanding:

structured models: used hand-coded abstract representations
pixel-based generative models: reconstruct future sensory input (e.g. image) based on past

JEPA is a new class that mixes both:

Challenges Core Knowledge Hypothesis
It achieves 92% and 68% zero shot accuracy on two benchmarks classifying videos as physically plausible or not. But pixel based methods, and multimodal large-language model perform about near chance.
Trains a V-JEPA model on video by masking blocks in video and training the model to predict those blocks [like Predictive Coding]
Works on Violation of Expectation framework: i.e measuring surprise of model. Higher surprise means the video doesn't conform with prior physics understanding