WORLD MODELS

Computing the Uncomputable

A visual guide based on the essay by Packy McCormick & Pim De Witte

Begin the journey

Have you ever had a dream where you simply watched? That's a video model. Have you ever had a lucid dream where you shaped the story? That's a World Model.

VIDEO MODEL passive
You can only watch
WORLD MODEL interactive
You're inside the dream
← → ↑ ↓ to move Tap edges to move
That at — the action at time t — is the magic.
02

Language Is Not Enough

Now try to describe it. In words. Every detail.

Clap your hands five times.

Language is an incredibly lossy compression of reality.

You just did it in 0.3 seconds. Describing it would take pages. Coding it would take months.

Joseph Knecht left Castalia because symbols alone weren't enough.

The Castalians

λ
π
φ
ω

Perfect symbolic manipulation

The Real World

Messy, embodied, unpredictable

Large Language Models are our Castalians. They can describe clapping, but they cannot clap.

03

The Loop

OBSERVE PREDICT ACT

A batter facing a 100mph fastball must swing before the visual signal of the ball even reaches their brain.

They don't react to reality — they react to their internal World Model's prediction.

A Brief History of World Models

1990-91

The Dream

What would a World Model do?

Schmidhuber · Sutton

Before we had the compute, the data, or the architecture, we had the dream. Schmidhuber proposed learning a model of the world. Sutton proposed unifying learning, planning, and reacting. Both were decades ahead of their time.

2018-19

Can this even work?

Proof of Concept

Ha & Schmidhuber · SimPLe

Ha and Schmidhuber asked: Can agents learn inside their own dreams? Using VAEs and RNNs, they trained agents entirely in imagination. SimPLe learned 26 Atari games from just 2 hours of gameplay.

V
M
C
Vision Memory Controller
2020-22

Can it match humans?

Human Performance

DreamerV2 · MuZero · IRIS · JEPA

DreamerV2 reached human-level on 55 Atari games, trained entirely in imagination on a single GPU. MuZero took the opposite approach — planning in abstract space without generating any frames. The generative vs latent split was born.

Generative Latent
2023-24

Can World Models be truly interactive?

Interactivity

GAIA-1 · DIAMOND · Genie

GAIA-1 scaled to 9 billion parameters on real driving video. DIAMOND used diffusion to produce a fully playable Counter-Strike from 87 hours of footage. Genie learned actions from scratch — no one told it the controls.

2025-NOW

Can models act in the real world?

The Real World

Comma.ai · V-JEPA 2 · SIMA 2 · General Intuition

Comma.ai deployed a World-Model-trained driving policy in production vehicles. V-JEPA 2 controlled robot arms zero-shot. The dream is becoming reality.

Not everything called a 'World Model' is one. Here's how the pieces fit together.

+ Current Foundation Models
LLMs
Learn language structure. Know gravity exists but have never felt it.
Video Models
Generate beautiful sequences. But you can't act inside them.
3D Reconstruction
Navigate through scenes. But the world doesn't respond.
+ World Models
Latent World Models
Predict in abstract space. Fast, efficient, invisible.
JEPA AMI DreamerV4
Generative World Models
Predict observable futures. You can see the dream.
DIAMOND GAIA Genie General Intuition
+ Embodied Agents
VLAs
Language model + action head. Pragmatic, battle-tested.
Physical Intelligence RT-2
Latent Agents
Practice by thinking. Chess grandmasters in their heads.
DreamerV4 Embo
General Agents
World Model trained, generalist. The end goal.
SIMA 2 General Intuition

Who's building what, and how much capital is behind them.

Latent Generative Digital Physical DeepMind World Labs AMI Labs Wayve Physical Intelligence General Intuition Decart Comma.ai

Every approach runs into the same wall: it needs better data.

Digital
Physical
Synthetic
Ground Truth
Game Engine Simulations
Programmed environments. Perfect control, limited realism.
Lab-Built Environments
Boston Dynamics-style controlled setups. Real physics, artificial scenarios.
Human Gaming Data
Real human responses in digital worlds. Billions of action-labeled clips.
General Intuition
Real-World Teleoperation
Humans operating robots. Rich but expensive and hard to scale.
Practical Platonic
VLAs
General Intuition
Latent World Models

The optimal path forward is likely somewhere between where VLAs are today and where AMI might be one day.

Input Modality Transfer
How well does a policy generalize across degrees of freedom?
Sensor Transfer
Does the workload require specialized sensors? Vision-only for GI.
Environment Transfer
Can agents trained in dreams act in reality? The open question.
?

For millennia, we watched shadows on the wall — describing reality in language, code, and symbols.

Then we learned to dream — building models that imagine what the world could look like.

Then we learned to act inside the dream — taking actions, observing consequences, and training in imagination.

Now the dream is becoming reality. Agents trained in World Models are driving cars, controlling robots, and learning skills no one explicitly taught them.

The real world is — or was — uncomputable.

World Models are changing that.

From Plato's cave to Schmidhuber's dreams, from Ha's car racing game to autonomous vehicles on Tokyo streets — we are learning to build machines that understand our world not through words, but through actions.

Read the full essay on Not Boring →