World Models: Computing the Uncomputable

01

The Dream Distinction

Have you ever had a dream where you simply watched? That's a video model. Have you ever had a lucid dream where you shaped the story? That's a World Model.

VIDEO MODEL passive

You can only watch

P(x_t+1 | x_t)

Predicts what happens next

WORLD MODEL interactive

You're inside the dream

← → ↑ ↓ to move Tap edges to move

P(s_t+1 | s_t, a_t)

Predicts what happens next given what you do

That a_t — the action at time t — is the magic.

02

Language Is Not Enough

Now try to describe it. In words. Every detail.

Clap your hands five times.

Language is an incredibly lossy compression of reality.

You just did it in 0.3 seconds. Describing it would take pages. Coding it would take months.

Joseph Knecht left Castalia because symbols alone weren't enough.

The Castalians

∫

∂

∇

∑

♪

♫

♩

♭

∀

∃

∞

∝

λ

π

φ

ω

Perfect symbolic manipulation

The Real World

Messy, embodied, unpredictable

Large Language Models are our Castalians. They can describe clapping, but they cannot clap.

03

The Loop

A batter facing a 100mph fastball must swing before the visual signal of the ball even reaches their brain.

They don't react to reality — they react to their internal World Model's prediction.

04

The Waves

A Brief History of World Models

1990-91

The Dream

What would a World Model do?

Schmidhuber · Sutton

Before we had the compute, the data, or the architecture, we had the dream. Schmidhuber proposed learning a model of the world. Sutton proposed unifying learning, planning, and reacting. Both were decades ahead of their time.

2018-19

Can this even work?

Proof of Concept

Ha & Schmidhuber · SimPLe

Ha and Schmidhuber asked: Can agents learn inside their own dreams? Using VAEs and RNNs, they trained agents entirely in imagination. SimPLe learned 26 Atari games from just 2 hours of gameplay.

V

→

M

→

C

Vision Memory Controller

2020-22

Can it match humans?

Human Performance

DreamerV2 · MuZero · IRIS · JEPA

DreamerV2 reached human-level on 55 Atari games, trained entirely in imagination on a single GPU. MuZero took the opposite approach — planning in abstract space without generating any frames. The generative vs latent split was born.

2023-24

Can World Models be truly interactive?

Interactivity

GAIA-1 · DIAMOND · Genie

GAIA-1 scaled to 9 billion parameters on real driving video. DIAMOND used diffusion to produce a fully playable Counter-Strike from 87 hours of footage. Genie learned actions from scratch — no one told it the controls.

2025-NOW

Can models act in the real world?

The Real World

Comma.ai · V-JEPA 2 · SIMA 2 · General Intuition

Comma.ai deployed a World-Model-trained driving policy in production vehicles. V-JEPA 2 controlled robot arms zero-shot. The dream is becoming reality.

05

The Taxonomy

Not everything called a 'World Model' is one. Here's how the pieces fit together.

+ Current Foundation Models

LLMs

Learn language structure. Know gravity exists but have never felt it.

Video Models

Generate beautiful sequences. But you can't act inside them.

3D Reconstruction

Navigate through scenes. But the world doesn't respond.

+ World Models

Latent World Models

Predict in abstract space. Fast, efficient, invisible.

JEPA AMI DreamerV4

Generative World Models

Predict observable futures. You can see the dream.

DIAMOND GAIA Genie General Intuition

+ Embodied Agents

VLAs

Language model + action head. Pragmatic, battle-tested.

Physical Intelligence RT-2

Latent Agents

Practice by thinking. Chess grandmasters in their heads.

DreamerV4 Embo

General Agents

World Model trained, generalist. The end goal.

SIMA 2 General Intuition

06

The Landscape

Who's building what, and how much capital is behind them.

07

The Data Question

Every approach runs into the same wall: it needs better data.

Digital

Physical

Synthetic

Ground Truth

Game Engine Simulations

Programmed environments. Perfect control, limited realism.

Lab-Built Environments

Boston Dynamics-style controlled setups. Real physics, artificial scenarios.

Human Gaming Data

Real human responses in digital worlds. Billions of action-labeled clips.

General Intuition

Real-World Teleoperation

Humans operating robots. Rich but expensive and hard to scale.

08

The Path Forward

Practical Platonic

VLAs

General Intuition

Latent World Models

The optimal path forward is likely somewhere between where VLAs are today and where AMI might be one day.

Input Modality Transfer

How well does a policy generalize across degrees of freedom?

Sensor Transfer

Does the workload require specialized sensors? Vision-only for GI.

Environment Transfer

Can agents trained in dreams act in reality? The open question.

?

09

The Agent Evolves

For millennia, we watched shadows on the wall — describing reality in language, code, and symbols.

Then we learned to dream — building models that imagine what the world could look like.

Then we learned to act inside the dream — taking actions, observing consequences, and training in imagination.

Now the dream is becoming reality. Agents trained in World Models are driving cars, controlling robots, and learning skills no one explicitly taught them.

The real world is — or was — uncomputable.

World Models are changing that.

From Plato's cave to Schmidhuber's dreams, from Ha's car racing game to autonomous vehicles on Tokyo streets — we are learning to build machines that understand our world not through words, but through actions.

Read the full essay on Not Boring →

WORLD MODELS

The Dream Distinction

Language Is Not Enough

The Castalians

The Real World

The Loop

The Waves

The Dream

Can this even work?

Can it match humans?

Can World Models be truly interactive?

Can models act in the real world?

The Taxonomy

The Landscape

The Data Question

The Path Forward

The Agent Evolves