2025-04-28

JEPA

Table of Contents

Joint Embedding Predictive Architecture

Papers

1. Intuitive Physics understanding from Videos

Intuitive physics understanding emerges from self-supervised pretraining on natural videos [arXiv] [pdf]

Previous work on Physics Understanding:

  • structured models: used hand-coded abstract representations
  • pixel-based generative models: reconstruct future sensory input (e.g. image) based on past

JEPA is a new class that mixes both:

  • prediction should be done in abstract representation space
  • but unlike structured model that space is learned

It

  • Challenges Core Knowledge Hypothesis
  • It achieves 92% and 68% zero shot accuracy on two benchmarks classifying videos as physically plausible or not. But pixel based methods, and multimodal large-language model perform about near chance.
  • Trains a V-JEPA model on video by masking blocks in video and training the model to predict those blocks [like Predictive Coding]
  • Works on Violation of Expectation framework: i.e measuring surprise of model. Higher surprise means the video doesn't conform with prior physics understanding

References

Backlinks


You can send your feedback, queries here