JEPA
Table of Contents
Joint Embedding Predictive Architecture
Papers
- 2022 papers/A Path Towards Autonomous Machine Intelligence.pdf
- 2024 See Revisiting Feature Prediction for Learning Visual Representations from Video on how to train V-JEPA model
- 2025 Intuitive physics understanding emerges from self-supervised pretraining on natural videos: https://arxiv.org/abs/2502.11831
1. Intuitive Physics understanding from Videos
Intuitive physics understanding emerges from self-supervised pretraining on natural videos [arXiv] [pdf]
Previous work on Physics Understanding:
- structured models: used hand-coded abstract representations
- pixel-based generative models: reconstruct future sensory input (e.g. image) based on past
JEPA is a new class that mixes both:
- prediction should be done in abstract representation space
- but unlike structured model that space is learned
It
- Challenges Core Knowledge Hypothesis
- It achieves 92% and 68% zero shot accuracy on two benchmarks classifying videos as physically plausible or not. But pixel based methods, and multimodal large-language model perform about near chance.
- Trains a V-JEPA model on video by masking blocks in video and training the model to predict those blocks [like Predictive Coding]
- Works on Violation of Expectation framework: i.e measuring surprise of model. Higher surprise means the video doesn't conform with prior physics understanding