World Models: The Next Substrate for Intelligence After Large Language Models

Sim2Real Research Team·15 min read·
World Models: The Next Substrate for Intelligence

Abstract

For the last few years, AI has mostly meant one thing: large language models (LLMs). They read the internet, predict the next token, and somehow that turns into translation, coding, and even strategy advice.

But if you zoom out a bit, you can see a new center of gravity forming in the research community: world models. These systems don't just autocomplete text; they learn how the world works and then simulate what might happen next. If LLMs are pattern recognizers, world models are consequence engines.

This piece is about what that shift really means — technically, strategically, and for the next generation of AI products.

1. What exactly is a "world model"?

In AI, a world model is an internal model of how the environment evolves over time. Given the current state of the world and an action an agent might take, it tries to predict the next state, the next observation, and often the reward or outcome.

Formally, you can think of it as a model of p(next_state, next_observation, reward | current_state, action). But conceptually it's much simpler:

A world model lets an agent imagine before it acts.

Humans do this constantly. Before crossing a busy street, you mentally simulate: If I walk now, will that car stop? That inner simulation is your (rough, noisy) world model.

In machine learning, the same idea has been explored for decades, especially in model-based reinforcement learning and recurrent neural networks. Schmidhuber's early work in the 1990s already framed planning with RNN-based world models as a central route to general intelligence. The idea went mainstream in 2018 with David Ha and Jürgen Schmidhuber's paper World Models, where an agent learns a compact representation of the environment using a VAE, predicts future latent states with an RNN, and then trains a controller purely inside that learned world.

The key move is always the same:

  1. Compress raw observations into a latent space
  2. Learn dynamics in that latent space (how states evolve under actions)
  3. Plan or learn policies by simulating futures in that latent space

2. Why world models matter now

LLMs are surprisingly powerful, but they have structural limitations:

  • They mostly see static sequences, not ongoing worlds
  • They don't naturally maintain persistent state over long horizons
  • Their notion of "consequences" is statistical, not grounded in causal dynamics

As a result, you get classic issues: hallucinations, brittle long-term planning, and difficulty transferring to physical systems like robots. A growing number of researchers now argue that scaling text-only LLMs may be hitting diminishing returns as a direct path to AGI.

World models directly attack three bottlenecks:

  1. Sample efficiency — An agent can practice in its own internal simulator instead of expensive real-world interaction.
  2. Generalization beyond the dataset — By learning underlying dynamics, a world model can extrapolate to new situations.
  3. Counterfactual reasoning — You can ask: What if we did X instead of Y? and roll that forward in simulation.

3. A quick tour of modern world models

3.1 From "World Models" to DreamerV3

Ha & Schmidhuber's World Models (2018) gave a clean, modular recipe: disentangle representation, dynamics, and control, and train agents by imagining future rolls.

In the Dreamer family, Danijar Hafner and collaborators pushed this idea much further. DreamerV3 learns a world model from raw experience and then uses imagined trajectories to train a policy, achieving strong performance across more than 150 diverse control tasks — including learning to mine diamonds in Minecraft from scratch, without human demonstrations.

This is a big conceptual shift: the real environment is just a data source, while the world model becomes the main playground where learning actually happens.

3.2 Genie: generative interactive environments

Google DeepMind's Genie line of models takes a complementary angle: train on massive internet video data to build a foundation world model that turns frames into interactive, controllable environments.

Genie 2 and Genie 3 can generate diverse, playable worlds from images, sketches, or text prompts. Users (or agents) can act in those worlds frame-by-frame, with the model predicting the consequences of each action. Genie 3 produces high-resolution interactive environments, supports multi-minute interactions, maintains short-term spatial consistency, and allows "promptable world events." This isn't just game tech — it's a general-purpose training ground for agents.

3.3 World foundation models for physical AI

Companies like NVIDIA are building world foundation models for robotics and "physical AI": large models trained on enormous video and simulation datasets to act as world twins — simulators that mirror and predict the real world.

The NVIDIA Cosmos platform includes large-scale video curation, pre-trained world foundation models, and tools for post-training and adaptation — all packaged for builders of robots, self-driving systems, and other embodied agents.

In this framing: LLMs are foundation models for language, World FMs are foundation models for dynamics. Together, they form the brain of physical AI.

World Models Architecture: LLM and World Model Integration

4. From token prediction to consequence modeling

LLMs: static, symbolic prediction

LLMs see sequences of tokens, predict the next token based on corpus statistics, and capture a huge amount of semantic and cultural structure. They're great for explaining, summarizing, designing plans in abstract terms, and acting as interfaces between humans and complex systems. But their "world" is mostly frozen into text.

World models: dynamic, stateful simulation

World models operate in latent state spaces that evolve over time, encode dynamics (physics, agents, interactions), and are built to simulate futures under different action sequences. They're great for planning and control, training agents in rich simulated environments, and counterfactual reasoning.

One way to think about the future stack:

LLMs become the "cortex for language and abstract reasoning", World models become the "simulation substrate" where those plans are tested and optimized.

The most interesting systems will be closed loops where:

  1. An LLM decomposes a high-level goal into strategies
  2. A world model simulates those strategies under different assumptions
  3. The LLM inspects the results, revises the plan, and chooses an execution path
  4. Real-world execution generates new data, improving the world model

That loop is far more powerful than pure text prediction.

5. Design patterns: how world models will show up in products

5.1 "Train in simulation, deploy in reality"

For physical AI — robots, autonomous vehicles, drones — data is expensive and sometimes dangerous to collect. World foundation models let you train policies in massive synthetic environments, stress-test edge cases, and quickly iterate on control logic without touching hardware.

The long-term product pattern: "Sim-first development" for robotics, like "mobile-first" was for apps.

5.2 Digital twins that are actually intelligent

Digital twins today are often glorified dashboards. World models push them toward an internal generative model of your factory, building, or supply chain — capable of simulating outcomes of interventions and accessible via natural language through an LLM front-end.

Imagine saying: "Simulate what happens to our logistics if Shanghai is partially closed for two weeks and fuel prices spike by 30%." The LLM grounds your request; the world model rolls the futures; the LLM explains the trade-offs back to you.

5.3 Interactive media and user-generated worlds

Genie-style models hint at a new creative stack: a user sketches or describes a world in text, the world model generates an interactive environment, and agents play inside that world in real time. Think of it as "Unity/Unreal, but the level designer is a generative world model."

5.4 Policy and alignment sandboxes

Before deploying powerful agents into high-stakes domains, regulators and companies will want to simulate long-horizon behaviors, test under adversarial scenarios, and explore how incentive structures interact with agent strategies. Rich world models make it possible to build testbeds for AI behavior.

6. Hard problems: why world models are not a magic bullet

6.1 Fidelity vs. tractability

You can't perfectly model the real world. Too simple → brittle policies. Too detailed → computationally intractable. Finding the right abstraction level is both a science and an art.

6.2 Distribution shift and "sim2real" gaps

Agents trained in a world model can fail catastrophically if the real world drifts away from the model, rare events weren't adequately modeled, or feedback loops appear that the simulator never captured. Bridging the sim2real gap remains one of the main bottlenecks for physical AI.

6.3 Safety and misuse

World models that can simulate complex social/physical environments raise new risks: optimization over people (agents finding strategies that manipulate humans) and dual use (realistic simulations helping adversaries test attacks).

6.4 Evaluation is fundamentally harder

Evaluating a world model means asking: "Is this model's long-horizon behavior realistic?" "Does it capture the right causal structure?" "Will policies trained here generalize safely?" These are inherently multi-dimensional, system-level questions.

7. Looking forward: a roadmap for the next decade

7.1 World models become cloud primitives

Just as today you can call text.generate() on an LLM, we'll see APIs like world.simulate(initial_state, policy, horizon) and world.generate(prompt, interactive=True). Cloud providers have strong incentives to host massive world foundation models.

7.2 LLMs become "goal compilers" into world models

LLMs will sit on top of world models, translating human intent into structured objectives that can be simulated and optimized. The intent → simulation → decision loop will be central to next-gen AI products.

7.3 Every serious robot gets a "world twin"

For any robot deployed at scale, it will be standard to have a policy twin (the robot's controller) and a world twin (a world model capturing its operational environment). Most learning will happen in the twin, not on the physical robot.

7.4 Simulated societies for policy and alignment

As world models become richer, we'll see simulated micro-economies to test financial regulations, synthetic populations to explore social policies, and multi-agent environments to test how powerful AI systems behave when incentives vary.

8. How to think about world models if you're building now

  1. Treat the world model as a core system component, not a sidecar. It's the place where your agents actually learn.
  2. Design for the loop, not the pieces. LLM ↔ world model ↔ real environment is one closed system.
  3. Invest in instrumentation and eval early. Debugging a mis-specified world model late in the game is painful.
  4. Expect regulation to arrive. Simulated testing will likely become a regulatory requirement.

9. Closing thought

LLMs showed us that scaling simple objectives can unlock emergent capabilities. World models are the next big bet: scale up the ability to simulate, not just to predict text.

If this works, the frontier of AI moves from talking about the world to rehearsing futures inside a learned world, and then carefully choosing which of those futures we bring into reality. That's not just a new model class. It's a new substrate for intelligence.


Key Takeaways

  • World models are consequence engines: They learn how the world works and simulate what might happen next, enabling agents to "imagine" before acting.
  • LLMs + World Models = Powerful combo: LLMs handle language and abstract reasoning; world models handle dynamics and control.
  • Sim-first development is coming: For robotics and physical AI, training in simulation before deployment will become standard.
  • Key challenges remain: Sim2real gaps, evaluation complexity, and safety concerns are core design constraints.
  • Cloud-native world models: Expect APIs like world.simulate() to become as common as text.generate().

References

Related Resources

Sim2Real