The Model That Dreams the World

Table of Contents

  1. The Most Overloaded Word in AI

  2. Two Threads Converging

  3. What Are World Models Actually Good For?

  4. The $10 Billion Bet

  5. What's Actually at Stake

1. The Most Overloaded Word in AI

Here's what a world model can do. A robot has never seen a shoelace before. No one has teleoperated it through the untying motion. But the robot reaches down, grasps the lace, and pulls it free. It succeeds because it was trained on a model that watched thousands of hours of human hands doing things, learned how objects behave when you pull and twist and push them, and can imagine what will happen before it acts. The robot practiced in its own imagination before touching reality.

That's the promise. A model that understands the physical world well enough to predict what happens next, and to act on those predictions. Not a language model that describes the world in words. Not a video generator that produces footage. A model of how things actually work.

Over $10 billion has been invested in this idea in the past 18 months. Yann LeCun left Meta to build one. Danijar Hafner, whose Dreamer series is the most influential work in model-based RL, left DeepMind to commercialize one. NVIDIA open-sourced an entire stack of them. OpenAI shut down Sora, framed the shutdown as a pivot to "world simulation for robotics," and then the team lead left the company three weeks later.

Most of what gets called a "world model" isn't one. The term now covers video generators, RL dream-machines, abstract representation learners, and action-predicting foundation models. Two separate research traditions recently merged to produce what we now call "video world models." How that happened, and whether the result actually works, is what this post is about.

Why now? Two things happened at the same time. First, interactive video models existed since 2024 (Genie, GameNGen), but only as narrow prototypes. In 2025, two breakthroughs (AR-DiT and Self Forcing) made it possible to take general-purpose, high-quality video foundation models and make them both interactive and real-time. That turned video world models from research curiosities into potentially useful infrastructure. Second, robotics has always been data-starved, but the hunger got orders of magnitude worse as the industry started training foundation models. Today's best robot foundation models train on around 10,000 hours of teleoperation data. But teleoperation is expensive, slow to collect, and narrow in diversity. World models offer a different path: pretrain on the millions of hours of human video that already exist, then fine-tune on small amounts of robot data.

A reality check is warranted, though. Robotics AI as a whole is earlier than the funding suggests. The most capable deployments today use VLAs (vision-language-action models), not world models. World models have shown strong results in specific settings — DreamDojo's near-perfect policy evaluation, DreamGen's generalization from minimal data — but general-purpose manipulation remains unsolved for everyone, regardless of approach.

2. Two Threads Converging

What we now call "video world models" came from two separate research traditions that developed in parallel for decades, then merged around 2024-2025.

Thread A: Learning to Dream (RL World Models, 1990-2025)

The idea that an agent should build an internal model of its world is older than deep learning. Kenneth Craik argued in 1943 (The Nature of Explanation) that humans carry "small-scale models" of reality in their heads to anticipate events. In 1990, Jürgen Schmidhuber published "Making the World Differentiable," formalizing this for neural networks: an intelligent agent should learn a differentiable model of its environment and use it to plan. The idea went largely dormant for almost three decades.

In 2018, David Ha and Schmidhuber revived it with a paper titled "World Models" and an interactive website, worldmodels.github.io, that let you watch an AI agent dream. The architecture was three modules: a VAE to compress pixels into latent vectors, an MDN-RNN to predict dynamics as probability distributions in that latent space, and a tiny controller trained entirely on imagined rollouts. An agent trained in its own dreams, deployed to reality, and it worked. Car Racing, VizDoom. Proof of concept.

Danijar Hafner then spent six years building on the same idea with a different architecture. His RSSM architecture (PlaNet, 2019) combined deterministic memory with stochastic uncertainty, solving a fundamental representation problem. The Dreamer series scaled from simple continuous control (V1, 2020) to human-level Atari (V2, 2021) to a single set of hyperparameters across 150+ benchmarks, including collecting diamonds in Minecraft from scratch (V3, published in Nature 2025). Dreamer 4 (late 2025) replaced the recurrent backbone with transformers, running 25x faster. DayDreamer (2022) put it on real robots: a quadruped learned to walk from scratch in one hour.

A notable branch: DeepMind's MuZero (2020) learned a world model that predicted rewards and values but never reconstructed observations. It only modeled what was decision-relevant, mastering Go, chess, and Atari without ever generating a single pixel. A different philosophy from Dreamer, which uses observation reconstruction as a training signal, but the same core idea: imagine possible futures, pick the best action.

Danijar Hafner then spent six years building on the same idea with a different architecture. His RSSM architecture (PlaNet, 2019) combined deterministic memory with stochastic uncertainty, solving a fundamental representation problem. The Dreamer series scaled from simple continuous control (V1, 2020) to human-level Atari (V2, 2021) to a single set of hyperparameters across 150+ benchmarks, including collecting diamonds in Minecraft from scratch (V3, published in Nature 2025). Dreamer 4 (late 2025) replaced the recurrent backbone with transformers, running 25x faster. DayDreamer (2022) put it on real robots: a quadruped learned to walk from scratch in one hour.

A notable branch: DeepMind's MuZero (2020) learned a world model that predicted rewards and values but never reconstructed observations. It only modeled what was decision-relevant, mastering Go, chess, and Atari without ever generating a single pixel. A different philosophy from Dreamer, which uses observation reconstruction as a training signal, but the same core idea: imagine possible futures, pick the best action.

What this tradition got right: the core idea. Learn dynamics. Imagine futures. Train policies from imagination rather than expensive real-world interaction. Action conditioning. Sample efficiency. These are the conceptual foundations that every video world model inherits today.

What it couldn't do: generalize across environments. A Dreamer agent could reach human-level performance on a single Atari game, but learning the next game required training from scratch. The models were small (millions of parameters), the dreams were abstract vectors no human could inspect, and they required thousands of task-specific episodes. The idea was right. The scale was wrong.

Thread B: Learning from Watching (2016-2025)

A parallel tradition was learning from video. It developed in stages, each bringing video closer to being useful for robot learning.

Stage 1: Video prediction for planning (2016-2018)

Oh et al. (2015) showed action-conditional video prediction in Atari. Finn et al. (2016) at Berkeley applied it to real robots: train a model to predict what the camera will see after an action, then plan by picking the action whose predicted future looks closest to your goal. It worked for simple pushing tasks, but predictions degraded within a few frames. Too blurry, too short-horizon for complex manipulation.

Stage 2: Learning representations from human video (2020-2022)

A key insight shift. Instead of predicting video directly, use human video to learn visual representations that transfer to robot tasks. R3M (Nair et al., 2022) was the breakthrough: a visual encoder pretrained on Ego4D, thousands of hours of egocentric human footage of cooking, cleaning, and object manipulation. The encoder learned to compress a camera image into a compact vector capturing object identity, spatial relationships, and grasp-relevant features, while ignoring irrelevant detail like wall color and shadows. A Franka arm using R3M features learned manipulation tasks from just 20 demonstrations, a fraction of what was needed without pretraining.

Around the same time, OpenAI's VPT (2022) showed internet-scale video pretraining worked for learning to act: a model pretrained on 70,000 hours of Minecraft YouTube gameplay could be fine-tuned into a capable agent with a small number of demonstrations. The first system to show that massive unlabeled video could bootstrap capable behavior in complex sequential tasks.

EgoMimic (Xu et al., ICRA 2025) pushed this further: instead of using human video only for representations, it treated egocentric human footage as actual demonstration data and co-trained a unified policy on both human and robot data. Human embodiment data boosted task performance by 34-228% over robot data alone, and enabled generalization to new objects and scenes.

But even these approaches had a ceiling. Better representations and more demonstration data helped, but the human video was training data for policies, not a simulator you could practice inside.

Stage 3: Video generation at scale (2022-2024)

The quality breakthrough came with diffusion models applied to video: Make-A-Video (Meta, 2022), Imagen Video (Google, 2022), and their successors. Diffusion transformers could generate high-quality, temporally coherent video at scale.

Sora (OpenAI, February 2024) was the inflection point. Trained on enormous internet video data, it generated footage that appeared to obey physics: objects fell, light scattered, cameras tracked convincingly. Google's Veo followed with competitive quality. OpenAI framed Sora as a "world simulator."

But Sora wasn't interactive. It used bidirectional attention: all frames see all other frames simultaneously. You can't inject an action mid-stream. It was a movie, not a game.

What this entire tradition contributed: the proof that human video contains transferable physical knowledge (R3M, VPT), photorealistic generation at scale (Sora, Veo), and the visual diversity that comes from internet-scale data.

What it couldn't do until the convergence: respond to actions in real time. Support the closed loop that robotics requires: act, see consequence, react. Generate video conditioned on specific actions, not just generate video that looks plausible.

The Convergence (2024-2025)

Each community needed what the other had. RL had action conditioning but couldn't generalize. Video had scale and realism but no interactivity. A series of works between 2024 and 2026 bridged the gap:

Genie (DeepMind, 2024-2025) introduced the latent action model: a way to learn interactive environments from unlabeled video. The model looks at two consecutive frames, compresses "what changed" into a small vector, and discovers an action space without anyone labeling the actions. Genie 1 (February 2024) was a proof of concept at 160x90 resolution and 1 FPS. Genie 2 (December 2024) scaled to photorealistic 720p with 10-60 seconds of consistency. Genie 3 (August 2025) reached 24 FPS at 720p with consistency sustained for minutes, though it generates 2D frames (not 3D geometry) and costs roughly $100 per hour to run.

UniSim (Sherry Yang et al., ICLR 2024 Outstanding Paper) went the other direction: it trained an RL policy entirely inside a video world model, then transferred to a real robot at 81% success. Earlier work (SimPLe, 2020) had trained RL inside learned video models for Atari, but UniSim was the first to do it with a high-quality video diffusion model and demonstrate zero-shot transfer to real-world robotics.

Two technical breakthroughs from Xun Huang's group removed the remaining barriers. AR-DiT / CausVid (CVPR 2025) made video diffusion models autoregressive and causal, the prerequisite for interactivity: instead of generating all frames at once, each frame was generated sequentially, conditioned on past frames and a current action. Self Forcing (NeurIPS 2025) then solved the speed problem, distilling 35 denoising steps down to 4 and enabling real-time interactive generation for the first time in general-purpose video models.

DreamGen (NVIDIA, May 2025) showed that video world models could unlock robot generalization from minimal real data. The approach: fine-tune a video generation model on a small amount of real robot footage (from the robot's own cameras, including wrist-mounted views), then prompt it with language instructions to generate synthetic videos of the robot performing tasks it's never done. An inverse dynamics model extracts motor commands from these synthetic videos, producing training data without teleoperation. A humanoid performed 22 new behaviors in unseen environments from just one pick-and-place demo. This was the first strong evidence that the convergence could produce practical robotics value.

The culmination: DreamDojo and DreamZero (NVIDIA, February 2026). DreamDojo: a video foundation model pretrained on 44,711 hours of human egocentric video, action-conditioned through a learned latent action space, distilled to real-time via Self Forcing, capable of evaluating robot policies with r=0.995 correlation to real-world outcomes. DreamZero went further, jointly predicting future video and robot motor actions in a single forward pass.

The RL people brought action conditioning and the concept of dreaming. The video people brought photorealistic generation and internet-scale data. The result is architecturally descended from video generation but philosophically descended from RL world models.

So What Actually Makes a World Model?

Not all video models are world models. Xun Huang, who co-authored the autoregressive diffusion architecture that several of these systems use, proposed five properties that separate a world model from a video generator:

  1. Causal: time flows forward only. Bidirectional video generation violates this. Hard constraint.

  2. Interactive: responds to actions in real time. Without this, it's a movie, not a simulation. Hard constraint.

  3. Persistent: maintains coherence over long durations. Current models sustain minutes, not hours.

  4. Real-time: fast enough for the application. State of the art: 10-30 FPS.

  5. Physically accurate: respects real-world physics. The hardest property and the most contested.

Causality and interactivity are binary. Without them, you don't have a world model. The other three are spectrums.

How do the major systems stack up against these five properties?


System

Causal

Interactive

Persistent

Real-time

Physically accurate

Sora / Veo

No

No

N/A

N/A

Visually plausible

Genie 3

Yes

Yes

Minutes

24 FPS

Visually plausible

DreamDojo

Yes

Yes

Minutes

10.8 FPS

r=0.995 policy eval

DreamZero

Yes

Yes

Minutes

7 Hz (Blackwell)

2× vs VLAs

Dreamer V3–V4

Yes

Yes

Unlimited (latent)

Real-time

Task-specific

V-JEPA 2

Yes

Partially

N/A

Sub-second

80% zero-shot

3. What Are World Models Actually Good For?

The Use Cases (Most Proven to Most Speculative)

AV simulation is the most mature application. Companies like Wayve (GAIA world model, $1.2B Series D) and Waymo have been using learned world models to generate driving scenarios for testing. The bar is: can you synthesize diverse, realistic driving scenarios that stress-test your self-driving policy? Not full physical accuracy, but enough visual and behavioral realism to find edge cases. This is in production.

Entertainment and gaming is close behind, and arguably has the most tangible demos. Decart's Oasis is a playable Minecraft-like game generated entirely by a world model at 20 FPS, already available to try. Genie 3 generates explorable environments at 24 FPS, 720p. GameNGen runs DOOM on a neural net at 20 FPS. Elon Musk's xAI has announced plans for world-model-based video games by end of 2026 (no demo yet). The physics bar is lower for gaming: players tolerate some unrealism if the experience is engaging. But the serving cost remains brutal: Genie 3 costs roughly $100 per hour to run.

Policy evaluation is where the clearest near-term value lies for robotics. DreamDojo achieves r=0.995 Pearson correlation between its predictions and real-world policy success rates. In practice, that means you can rank 20 candidate policies inside the world model instead of running 20 expensive real-world trials, and the ranking will match reality almost perfectly. This turns the world model into a testing environment — unit tests for robot behavior.

Synthetic training data generation is promising but the marginal value is unclear. DreamGen (NVIDIA, 2025) showed a humanoid performing 22 new behaviors in unseen environments from a single teleoperation demonstration, using video-world-model-generated synthetic data. But even researchers building these systems acknowledge the improvement is modest: there's some gain, but not the dramatic leap the field hoped for. The question is whether synthetic video data provides enough signal above what you'd get from more teleoperation data or better augmentation.

Sample-efficient learning works in controlled settings. DayDreamer (2022) demonstrated a quadruped learning to walk from scratch in one hour of real-world interaction, because the Dreamer world model could imagine thousands of practice runs between each real attempt. But this hasn't been demonstrated at scale in production environments.

Direct robot control is the most ambitious claim and the least proven. DreamZero predicts both future video and motor actions in a single forward pass, reporting 2x better generalization than VLA baselines in its own evaluation. But this is one paper, from the team that built it, with no independent replication. Meanwhile, VLAs keep advancing fast: Pi-0.5 (September 2025) generalized to unseen homes, Pi-0.6 (November 2025) added RL-based self-improvement, and Pi-0.7 (April 2026) composes learned skills to solve novel tasks. A new version every few months, each more capable. VLAs are simpler, cheaper to deploy, and currently more battle-tested.

The Bigger Picture: Robotics AI Is Earlier Than It Looks

The honest framing isn't "VLAs work, world models don't." It's that robotics AI as a whole is earlier than $10 billion in funding suggests. Navigation and constrained warehouse picking work reliably. Cooking demos work in controlled labs with dozens of task-specific demonstrations (ALOHA/Sunday achieves 90% on shrimp sauteing with 50 demos), but each new dish requires new demonstrations. General household manipulation, furniture assembly, and contact-rich dexterous tasks remain unsolved regardless of approach.

The deeper issues cut across both VLAs and world models. Transfer is rarely demonstrated and should never be assumed. Both approaches are vision-only and miss touch, force feedback, and proprioception, which matter enormously for manipulation. The standard training datasets (like Open X-Embodiment) have serious quality and diversity problems. And simulation benchmarks are nearly saturated while real-world zero-shot performance lags far behind.

Meanwhile, the VLA approach isn't standing still. Physical Intelligence's Pi-0.7 (April 2026) demonstrates compositional generalization, combining skills from different tasks to solve novel problems, without any world model. It operates a never-seen air fryer by blending fragments of related training experience. This complicates the narrative that world models are uniquely needed for generalization. Both approaches are making progress through different paths: world models bet on understanding dynamics through video prediction, VLAs bet on scale and compositional training.

The question isn't which approach is winning. It's whether either approach is close enough to general manipulation that scaling will finish the job. The world model community's specific bet: understanding dynamics (through video prediction) will matter for the hardest remaining tasks, where pattern matching from demonstrations isn't enough. That bet looks reasonable, even if the payoff timeline is uncertain.

4. The $10 Billion Bet

Over $10 billion has been deployed into world model and robotics AI companies in the past 18 months. The capital tells you where the industry actually is, not where the papers say it should be.

The money sits in four layers. Pure world model companies building the simulator itself (AMI Labs $1.03B, World Labs $1.23B, Runway $860M+, Rhoda $450M, Decart $153M, Embo $100M+). Robot foundation model companies that use world models as a component (Skild $1.83B, Physical Intelligence $1.1B+, Figure $2B+, Mind Robotics $615M). Platforms that build and open-source the infrastructure (NVIDIA, Google DeepMind). And big tech pivots (OpenAI's post-Sora robotics effort, Tesla, xAI).

The pattern is notable: companies using world models have raised more than companies building them. Either the world model layer is underfunded relative to its importance, or the largest robotics companies will build this capability in-house. 1X Technologies has already done exactly that.

NVIDIA: The 800-Pound Gorilla

The most important strategic development in the space isn't a startup. It's NVIDIA building the full physical AI stack and open-sourcing it.

The stack runs from Cosmos Predict 2.5 (video foundation model, 14B params, 200M video clips) through DreamDojo (action-conditioned world model, 44K hours of human video, r=0.995 policy evaluation) through DreamZero (joint video + action prediction, zero-shot on unseen tasks) through EgoScale (the scaling law: R²=0.9983 between human video hours and robot performance) up to GR00T N2 (productized robot brain, end of 2026). Every layer is open-source Apache 2.0.

The strategy is CUDA for physical AI: give away the software, sell the hardware. DreamZero runs at 7Hz, but only on Blackwell GB200. Not real-time on H100. If every robotics company builds on this stack, they all need Blackwell.

For startups building pure world models, this is existential. DreamDojo is free and trained on 44,000 hours of video. "We built a world model" is no longer a moat. The differentiation has to come from domain-specific data NVIDIA doesn't have, faster inference, or vertical integration into a product that's more than just the model.

The JEPA Contrarian Bet

Not everyone is building video world models. Yann LeCun, who left Meta in late 2025 to found AMI Labs ($1.03B, the largest European seed round ever), argues that predicting pixels is fundamentally wasteful. Most pixel-level detail is irrelevant to understanding dynamics. His alternative, JEPA (Joint Embedding Predictive Architecture), encodes observations into abstract representations and predicts future representations directly, never generating video. Unlike Dreamer, which uses pixel reconstruction as a training signal, JEPA avoids reconstruction entirely.

V-JEPA 2, developed at Meta before LeCun's departure, was pretrained on over a million hours of internet video, then fine-tuned on just 62 hours of robot data. It achieved 80% zero-shot success on pick-and-place tasks without generating a single frame of video. AMI Labs now has a billion dollars to test whether abstract prediction outperforms pixel prediction. The counter-argument: pixel-level prediction might capture physical details that abstract representations miss, and you can watch what a video model thinks will happen. JEPA's predictions are abstract vectors that humans can't inspect.

Where We See Opportunity

NVIDIA's open-source stack raises a real question for everyone building in this space: what's defensible? Our view is that there are several distinct opportunity types, each with different bets and different time horizons.

Frontier horizontal world models. The boldest bet: build a better general-purpose world model than NVIDIA's. AMI Labs is taking this path with JEPA, predicting in abstract representation space. Embo with a different architectural philosophy from the Dreamer lineage. Dream Labs (founded by Joel Jang from NVIDIA's GEAR Lab) building on the DreamGen and DreamDojo line of work. Cosmos and DreamDojo are version 1 of a new category. There's room for architectural leapfrog, the way OpenAI built ChatGPT after years of DeepMind's earlier work.

Vertical-specific world models. NVIDIA's stack is general-purpose. A company building a world model specifically for surgical robotics, warehouse manipulation, or food preparation, with proprietary data from actual deployments, could carve a moat that general models can't match. The analogy: Bloomberg Terminal vs. ChatGPT. Both do language, but Bloomberg's domain data and workflow integration make it irreplaceable for finance professionals. Whether this works depends on how much domain-specific dynamics matter — surgical contact forces and warehouse picking really are different physics regimes that general video models may not capture.

The picks-and-shovels layer. Inference infrastructure, evaluation platforms, sim-to-real transfer tools, data pipelines for egocentric video. These are less glamorous than "we built a world model" but address real pain: Genie 3 costs ~$100/hour to run, Odyssey requires a full H200 per user, video model serving is structurally expensive. Companies that solve these problems capture value across the entire ecosystem. The risk: NVIDIA owns the hardware, and inference optimization is fast-moving research that gets absorbed into open source quickly.

The "world model inside a product" play. Companies where the world model is one component of a vertically integrated robotics product, not the product itself. The end customer pays for outcomes — folded laundry, sorted packages, brewed espresso — not for inference. The model is a means, the robot doing useful work is the product. This is the path most existing robotics companies have taken (1X built their own world model, Figure and Skild integrate Cosmos), but new entrants face hardware + software + go-to-market simultaneously.

The deepest opportunity might be one that doesn't fit neatly into any of these: a startup that combines a frontier model bet with a wedge product or vertical, building the model and the application together so each reinforces the other. Most of the durable AI companies of the last cycle did exactly this.

5. What's Actually at Stake

Three questions stand out.

Can startups survive NVIDIA's commoditization? The open-source stack is comprehensive and free. Startups need an answer beyond "we also built a world model." The viable paths are narrow: proprietary data, specialized domains, or products where the world model is embedded in something bigger.

When does "good enough" physics become actual physics? The EgoScale scaling law suggests brute-force data can sidestep physical understanding for in-distribution tasks. But the gap between interpolation and extrapolation is where robots break things and hurt people. Whether data scaling can close this gap, or whether it requires fundamentally different approaches, remains the deepest open question.

Will the value materialize? World models have clear value in policy evaluation and AV simulation today. The first concrete robotics use cases (DreamDojo for evaluation, DreamGen for synthetic data) are real and working. The EgoScale scaling law suggests the returns improve predictably with more data, and the data is about to get much bigger.

Two research traditions, one from reinforcement learning and one from video generation, spent decades developing in parallel and recently merged into something genuinely new. The convergence produced systems that can imagine physical futures, respond to actions in real time, and transfer knowledge from human video to robots. None of this existed three years ago.

Whether these systems cross the threshold from useful research tool to indispensable robotics infrastructure depends on whether understanding dynamics turns out to matter for the hardest manipulation tasks, the ones where seeing enough examples isn't sufficient and you actually need to predict what happens when you push, pull, and twist. We think it will. But the timeline is less certain than $10 billion suggests, and the honest thing is to say so.

References

Papers

  1. Schmidhuber, J. (1990). "Making the World Differentiable." Technical Report FKI-126-90, TU Munich.

  2. Craik, K. (1943). The Nature of Explanation. Cambridge University Press.

  3. Oh, J. et al. (2015). "Action-Conditional Video Prediction using Deep Networks in Atari Games." NeurIPS 2015.

  4. Finn, C., Goodfellow, I. & Levine, S. (2016). "Unsupervised Learning for Physical Interaction through Video Prediction." NeurIPS 2016.

  5. Ha, D. & Schmidhuber, J. (2018). "World Models." NeurIPS 2018. Interactive demos: worldmodels.github.io

  6. Hafner, D. et al. (2019). "Learning Latent Dynamics for Planning from Pixels." ICML 2019. (PlaNet)

  7. Schrittwieser, J. et al. (2020). "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model." Nature. (MuZero)

  8. Hafner, D. et al. (2020). "Dream to Control: Learning Behaviors by Latent Imagination." ICLR 2020. (Dreamer V1)

  9. Hafner, D. et al. (2021). "Mastering Atari with Discrete World Models." ICLR 2021. (DreamerV2)

  10. Nair, S. et al. (2022). "R3M: A Universal Visual Representation for Robot Manipulation." CoRL 2022.

  11. Baker, B. et al. (2022). "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos." NeurIPS 2022.

  12. Wu, P. et al. (2022). "DayDreamer: World Models for Physical Robot Learning." CoRL 2022. Project: danijar.com/project/daydreamer

  13. LeCun, Y. (2022). "A Path Towards Autonomous Machine Intelligence." OpenReview.

  14. Hafner, D. et al. (2025). "Mastering Diverse Domains through World Models." Nature. (DreamerV3)

  15. Bruce, J. et al. (2024). "Genie: Generative Interactive Environments." ICML 2024. (Genie 1)

  16. Yang, S. et al. (2024). "Learning Interactive Real-World Simulators." ICLR 2024 Outstanding Paper. (UniSim). Project: universal-simulator.github.io/unisim

  17. Valevski, D. et al. (2024). "Diffusion Models Are Real-Time Game Engines." (GameNGen). Project: gamengen.github.io

  18. Yin, T., Huang, X. et al. (2025). "From Slow Bidirectional to Fast Autoregressive Video Diffusion Models." CVPR 2025. (AR-DiT / CausVid)

  19. Huang, X. et al. (2025). "Self Forcing." NeurIPS 2025.

  20. Hafner, D. & Yan, W. (2025). "Training Agents Inside of Scalable World Models." (Dreamer 4)

  21. Jang, J. et al. (2025). "DreamGen: Unlocking Generalization in Robot Learning through Video World Models." CoRL 2025.

  22. Gao, S., Liang, W. et al. (2026). "DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos." Project: dreamdojo-world.github.io

  23. Ye, S., Ge, Y. et al. (2026). "DreamZero: World Action Models as Zero-shot Policies." Project: dreamzero0.github.io

  24. Zheng, K., Niu, D. et al. (2026). "EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data."

  25. Physical Intelligence. (2025). "Pi-0.5: a Vision-Language-Action Model with Open-World Generalization."

Blog Posts & Analysis

Company & Industry News

Projects & Demos