Post: This AI Model Can Intuit How the Physical World Works

This AI Model Can Intuit How the Physical World Works

The original version K This story I appeared Quanta Magazine.

Here’s a test for toddlers: Show them a glass of water on a desk. Hide it behind the wooden board. Now move the board towards the glass. Are they surprised if the board continues to pass through the glass, as if it is not there? Many 6-month-olds have, and by a year, almost all babies have an intuitive concept of an object’s stability, which is learned through observation. Now some artificial intelligence models do too.

Researchers have developed an AI system that learns about the world through videos and when presented with this information, it displays a perception of “surprise” at what it has achieved.

This model, developed by Meta and called the Video Joint Embedding Prediction Architecture (V-JEPA), makes no assumptions about the physics of the world in the videos. After all, it can start to make sense of how the world works.

“Their claims are, a priori, very plausible, and the results are extremely interesting.” Micah Heilbrona cognitive scientist at the University of Amsterdam who studies how brains and artificial systems perceive the world.

High abstraction

As engineers who build self-driving cars know, getting an AI system to reliably understand what it sees can be difficult. Most systems are designed to “understand” videos by either classifying their content (“a person playing tennis”, for example) or identifying the shape of an object – say, a car in front – in what is called “pixel space”. The model essentially treats every pixel in the video as equally important.

But these pixel-space models come with limitations. Imagine trying to get the feel of a suburban street. If the scene contains cars, traffic lights and trees, the model may focus too much on irrelevant details such as the movement of leaves. It may miss the color of a traffic light, or the positions of nearby cars. “When you go to photos or video, you don’t want to work [pixel] space because there are a lot of details that you don’t want to model,” Randall Balesteroa computer scientist at Brown University.

Image contains face of Yin Likun happy head person smile photography portrait dimple adult and accessories

New York University computer scientist and director of AI research at Meta created Jappa in 2022, the predecessor of VJappa, which works on still images.

Photo: école polytechnique Université Paris-Saclay