Everyone Wants to Build a World Model. Almost Nobody Agrees on What That Is.

Table of Contents
The Word That Ate AI
Two Threads Converging
What Makes a World Model?
What Are World Models Actually Good For?
The $10 Billion Bet
What's Actually at Stake
1. The Word That Ate AI
Here's what a world model can do. A robot has never seen a shoelace before. No one has teleoperated it through the untying motion. But the robot reaches down, grasps the lace, and pulls it free. It succeeds because it was trained on a model that watched thousands of hours of human hands doing things, learned how objects behave when you pull and twist and push them, and can imagine what will happen before it acts. The robot practiced in its own imagination before touching reality.
That's the promise. A model that understands the physical world well enough to predict what happens next, and to act on those predictions. Not a language model that describes the world in words. Not a video generator that produces footage. A model of how things actually work.
Over $10 billion has been invested in this idea in the past 18 months. Yann LeCun left Meta to build one. Danijar Hafner, whose Dreamer series is the most influential work in model-based RL, left DeepMind to commercialize one. NVIDIA open-sourced an entire stack of them. OpenAI shut down Sora, framed the shutdown as a pivot to "world simulation for robotics," and then the team lead left the company three weeks later.
Most of what gets called a "world model" isn't one. The term now covers video generators, RL dream-machines, abstract representation learners, and action-predicting foundation models. These aren't competing implementations of the same idea. They're different research programs that recently merged from two separate traditions. One of the less obvious but most consequential developments in AI is how that merger happened, and what it produced.
Why now? Two things happened at the same time. First, interactive video models existed since 2024 (Genie, GameNGen), but only as narrow prototypes. In 2025, two breakthroughs (AR-DiT and Self Forcing) made it possible to take general-purpose, high-quality video foundation models and make them both interactive and real-time. That turned video world models from research curiosities into potentially useful infrastructure. Second, robotics has always been data-starved, but the hunger got orders of magnitude worse as the industry started training foundation models. Today's best robot foundation models train on around 10,000 hours of teleoperation data. But teleoperation is expensive, slow to collect, and narrow in diversity. World models offer a different path: pretrain on the millions of hours of human video that already exist, then fine-tune on small amounts of robot data.
But there's an uncomfortable question underneath the excitement. The most capable robot deployments today use VLAs, not world models. Physical Intelligence's Pi-0.7 (April 2026) demonstrates compositional generalization, mixing and matching learned skills to solve tasks it's never seen. Skild's Brain already generates revenue controlling multiple robot types. World models have shown strong results in specific settings — DreamDojo's near-perfect policy evaluation, DreamGen's generalization from minimal data — but these haven't yet translated into the kind of production-scale advantage that makes them indispensable. Bessemer's analysis identifies unresolved gaps: temporal drift over long horizons, no tactile sensing, and inference costs 100x too high for real-time control. The $10 billion is a bet on a curve that hasn't inflected, not a response to proven demand.
2. Two Threads Converging
What we now call "video world models" is the product of two separate research lineages that developed in parallel for decades, then converged around 2024-2025. Telling this as one story misses what each tradition contributed and why they needed each other.
Thread A: Learning to Dream (RL World Models, 1990-2025)
The idea that an agent should build an internal model of its world is older than deep learning. Kenneth Craik argued in 1943 (The Nature of Explanation) that humans carry "small-scale models" of reality in their heads to anticipate events. In 1990, Jürgen Schmidhuber published "Making the World Differentiable," formalizing this for neural networks: an intelligent agent should learn a differentiable model of its environment and use it to plan. The idea went largely dormant for almost three decades.
In 2018, David Ha and Schmidhuber revived it with a paper titled "World Models" and an interactive website, worldmodels.github.io, that let you watch an AI agent dream. The architecture was three modules: a VAE to compress pixels into latent vectors, an MDN-RNN to predict dynamics as probability distributions in that latent space, and a tiny controller trained entirely on imagined rollouts. An agent trained in its own dreams, deployed to reality, and it worked. Car Racing, VizDoom. Proof of concept.
Danijar Hafner then spent six years building on the same idea with a different architecture. His RSSM architecture (PlaNet, 2019) combined deterministic memory with stochastic uncertainty, solving a fundamental representation problem. The Dreamer series scaled from simple continuous control (V1, 2020) to human-level Atari (V2, 2021) to a single set of hyperparameters across 150+ benchmarks, including collecting diamonds in Minecraft from scratch (V3, published in Nature 2025). Dreamer 4 (late 2025) replaced the recurrent backbone with transformers, running 25x faster. DayDreamer (2022) put it on real robots: a quadruped learned to walk from scratch in one hour.
A notable branch: DeepMind's MuZero (2020) learned a world model that predicted rewards and values but never reconstructed observations. It only modeled what was decision-relevant, mastering Go, chess, and Atari without ever generating a single pixel. A different philosophy from Dreamer, which uses observation reconstruction as a training signal, but the same core idea: imagine possible futures, pick the best action.
In 2022, Yann LeCun published "A Path Towards Autonomous Machine Intelligence," proposing a full cognitive architecture with a world model at its center. His proposed approach, JEPA (Joint Embedding Predictive Architecture), predicted in abstract representation space rather than pixel space, explicitly avoiding the cost of generating observations. This would later become the intellectual foundation for AMI Labs.
What this tradition contributed: the idea that you can learn dynamics, imagine futures, and train policies from imagination. Action conditioning. Sample efficiency. Planning inside a model.
What it couldn't do: generalize. Every new environment needed training from scratch. The dreams were abstract vectors, not visual. And the models were small: millions of parameters, trained on thousands of episodes in a single environment.
Thread B: Learning from Watching (2016-2025)
A parallel tradition was learning from video. It developed in stages, each bringing video closer to being useful for robot learning.
Stage 1: Video prediction for planning (2016-2018)
Oh et al. (2015) showed action-conditional video prediction in Atari. Finn et al. (2016) at Berkeley applied it to real robots: train a model to predict what the camera will see after an action, then plan by picking the action whose predicted future looks closest to your goal. It worked for simple pushing tasks, but predictions degraded within a few frames. Too blurry, too short-horizon for complex manipulation.
Stage 2: Learning representations from human video (2020-2022)
A key insight shift. Instead of predicting video directly, use human video to learn visual representations that transfer to robot tasks. R3M (Nair et al., 2022) was the breakthrough: a visual encoder pretrained on Ego4D, thousands of hours of egocentric human footage of cooking, cleaning, and object manipulation. The encoder learned to compress a camera image into a compact vector capturing object identity, spatial relationships, and grasp-relevant features, while ignoring irrelevant detail like wall color and shadows. A Franka arm using R3M features learned manipulation tasks from just 20 demonstrations, a fraction of what was needed without pretraining.
Around the same time, OpenAI's VPT (2022) showed internet-scale video pretraining worked for learning to act: a model pretrained on 70,000 hours of Minecraft YouTube gameplay could be fine-tuned into a capable agent with a small number of demonstrations. First proof at scale that watching humans do things teaches an AI to do those things.
But these approaches had a ceiling. R3M gave the robot better "eyes" (representations) but it still needed robot demonstrations to learn the "hands" (policy). The human video was a feature extractor, not a simulator you could practice inside.
Stage 3: Video generation at scale (2022-2024)
The quality breakthrough came with diffusion models applied to video: Make-A-Video (Meta, 2022), Imagen Video (Google, 2022), and their successors. Diffusion transformers could generate high-quality, temporally coherent video at scale.
Sora (OpenAI, February 2024) was the inflection point. Trained on enormous internet video data, it generated footage that appeared to obey physics: objects fell, light scattered, cameras tracked convincingly. Google's Veo followed with competitive quality. OpenAI framed Sora as a "world simulator."
But Sora wasn't interactive. It used bidirectional attention: all frames see all other frames simultaneously. You can't inject an action mid-stream. It was a movie, not a game.
What this entire tradition contributed: the proof that human video contains transferable physical knowledge (R3M, VPT), photorealistic generation at scale (Sora, Veo), and the visual diversity that comes from internet-scale data.
What it couldn't do until the convergence: respond to actions in real time. Support the closed loop that robotics requires: act, see consequence, react. Generate video conditioned on specific actions, not just generate video that looks plausible.
The Convergence (2024-2025)
Each community realized it needed what the other had. RL world models had the right idea (imagine futures, condition on actions, train policies) but couldn't generalize beyond single environments. Video generation had the right data and scale (internet-scale pretraining, photorealistic output) but no interactivity.
Four works bridged the gap:
Genie 1 (DeepMind, February 2024) introduced the latent action model: a way to learn interactive environments from unlabeled video, without requiring action labels. It brought RL's concept of action conditioning into the video generation world. Trained on 200,000 hours of platformer videos, it could turn a single image into a playable environment, though only at 160x90 resolution and 1 FPS. Genie 2 (December 2024) scaled this to photorealistic environments at 720p with 10-60 seconds of consistency, using an autoregressive latent diffusion backbone.
UniSim (Sherry Yang et al., ICLR 2024 Outstanding Paper) went the other direction: it trained an RL policy entirely inside a video world model, then transferred to a real robot at 81% success. First proof that video generation could serve as an RL training environment.
GameNGen (Google Research, 2024) offered a vivid demonstration from a different angle: a fine-tuned Stable Diffusion model running the original DOOM interactively at 20+ FPS on a single TPU. No game engine. The neural network was the engine. Narrow, but it made the concept viscerally concrete: a learned model replacing a hand-coded simulator entirely.
AR-DiT / CausVid (Tianwei Yin, Xun Huang et al., CVPR 2025) made video diffusion models autoregressive and causal. This was the technical prerequisite for interactivity. Instead of generating all frames at once (bidirectional), each frame was generated sequentially, conditioned on past frames and a current action. Movies became games.
Self Forcing (Xun Huang et al., NeurIPS 2025) solved the speed problem. Autoregressive diffusion models were slow (35 denoising steps per frame). Self Forcing distilled this to 4 steps, enabling real-time interactive generation. Without this, video world models were too slow to be useful.
DreamGen (NVIDIA, May 2025) showed that video world models could unlock robot generalization from minimal real data. A humanoid performed 22 new behaviors in unseen environments from just one pick-and-place teleoperation demo. The video model generated synthetic training data; the robot learned from it. This was the first strong evidence that the convergence could produce practical robotics value.
The culmination: DreamDojo and DreamZero (NVIDIA, February 2026). DreamDojo: a video foundation model pretrained on 44,711 hours of human egocentric video, action-conditioned through a learned latent action space, distilled to real-time via Self Forcing, capable of evaluating robot policies with r=0.995 correlation to real-world outcomes. DreamZero went further, jointly predicting future video and robot motor actions in a single forward pass.
The RL people brought action conditioning and the concept of dreaming. The video people brought photorealistic generation and internet-scale data. The result is architecturally descended from video generation but philosophically descended from RL world models.
3. What Makes a World Model?
Not all video models are world models. This distinction matters more than almost anything else in the space, and most coverage misses it.
Xun Huang, who co-authored the autoregressive diffusion architecture that several video world models now use, proposed five properties that separate a world model from a video generator. Ordered from most fundamental to most challenging:
Causal: time flows forward only. The past determines the future, not vice versa. Bidirectional video generation violates this. Hard constraint.
Interactive: the model responds to actions injected in real time. Without this, it's a movie, not a simulation. Hard constraint.
Persistent: maintains coherence over long durations. Objects don't vanish. Rooms don't change layout. Current models sustain consistency for minutes, not hours.
Real-time: generates frames fast enough for the application. Live streaming needs ~1 second latency. Gaming needs 100ms. VR needs 10ms. State of the art: 10-30 FPS depending on the system.
Physically accurate: respects real-world physics. Objects fall, collide, deform correctly. The hardest property and the most contested.
Causality and interactivity are binary. Without them, you don't have a world model. The other three are spectrums.
How Existing Systems Stack Up
The pixel-vs-latent distinction (whether the model generates video frames or operates in abstract vector space) is one technical axis, but it's not what determines whether something is a world model. Sora generates pixels but isn't a world model because it fails the first two properties. Dreamer operates in latent space and is one.
The JEPA Counterargument
There's a third path that rejects both pixel-space video generation and the Dreamer-style approach. Yann LeCun, who left Meta in late 2025 to found AMI Labs ($1.03B, the largest European seed round ever), argues that predicting pixels is fundamentally wasteful. Most pixel-level detail — exact textures, lighting angles, background clutter — is irrelevant to understanding dynamics. A model that spends capacity reproducing these details is doing unnecessary work.
His alternative, JEPA (Joint Embedding Predictive Architecture), works differently from both Dreamer and video world models. It encodes observations into abstract representations, then predicts future representations directly — never decoding back to pixels. Unlike Dreamer, which uses pixel reconstruction as a training signal (the world model learns partly by trying to reconstruct what it sees), JEPA avoids reconstruction entirely. It stays in abstract embedding space by design.
V-JEPA 2, developed at Meta before LeCun's departure, was pretrained on over a million hours of internet video, then fine-tuned on just 62 hours of robot data. It achieved 80% zero-shot success on pick-and-place tasks without generating a single frame of video.
LeCun's bet: if you discard pixel-level detail from the start, your representations will be more robust, more transferable, and more compute-efficient. AMI Labs now has a billion dollars to test this thesis. The counter-argument from the video world model camp: pixel-level prediction might capture physical details (contact dynamics, material properties, deformation) that abstract representations miss. And pixel-space predictions are directly interpretable — you can watch what the model thinks will happen. JEPA's predictions are abstract vectors that humans can't inspect.
This is a genuine architectural disagreement, not a marketing distinction. It will likely take years to resolve.
The questions that matter most right now: Is the system action-conditioned? What data was it trained on? And critically: what is it actually useful for?
4. What Are World Models Actually Good For?
This is the question the field hasn't fully answered, and the honest assessment is more nuanced than either the hype or the skepticism suggests.
The Use Cases (Most Proven to Most Speculative)
AV simulation is the most mature application. Companies like Wayve (GAIA world model, $1.2B Series D) and Waymo have been using learned world models to generate driving scenarios for testing. The bar is: can you synthesize diverse, realistic driving scenarios that stress-test your self-driving policy? Not full physical accuracy, but enough visual and behavioral realism to find edge cases. This is in production.
Entertainment and gaming is close behind, and arguably has the most tangible demos. Decart's Oasis is a playable Minecraft-like game generated entirely by a world model at 20 FPS, already available to try. Genie 3 generates explorable environments at 24 FPS, 720p. GameNGen runs DOOM on a neural net at 20 FPS. Elon Musk's xAI has announced plans for world-model-based video games by end of 2026 (no demo yet). The physics bar is lower for gaming: players tolerate some unrealism if the experience is engaging. But the serving cost remains brutal: Genie 3 costs roughly $100 per hour to run.
Policy evaluation is where the clearest near-term value lies for robotics. DreamDojo achieves r=0.995 Pearson correlation between its predictions and real-world policy success rates. In practice: if you have 20 candidate robot policies, you can rank them inside DreamDojo instead of running 20 expensive real-world trials. The ranking will be almost identical to reality. This turns the world model into a testing environment, like unit tests for robot behavior.
Synthetic training data generation is promising but the marginal value is unclear. DreamGen (NVIDIA, 2025) showed a humanoid performing 22 new behaviors in unseen environments from a single teleoperation demonstration, using video-world-model-generated synthetic data. But even researchers building these systems acknowledge the improvement is modest: there's some gain, but not the dramatic leap the field hoped for. The question is whether synthetic video data provides enough signal above what you'd get from more teleoperation data or better augmentation.
Sample-efficient learning works in controlled settings. DayDreamer (2022) demonstrated a quadruped learning to walk from scratch in one hour of real-world interaction, because the Dreamer world model could imagine thousands of practice runs between each real attempt. But this hasn't been demonstrated at scale in production environments.
Direct robot control is the most ambitious claim and the least proven. DreamZero predicts both future video and motor actions in a single forward pass, reporting 2x better generalization than VLA baselines in its own evaluation. But this is one paper, from the team that built it, with no independent replication. Meanwhile, VLAs keep advancing fast: Pi-0.5 (September 2025) generalized to unseen homes, Pi-0.6 (November 2025) added RL-based self-improvement, and Pi-0.7 (April 2026) composes learned skills to solve novel tasks. A new version every few months, each more capable. VLAs are simpler, cheaper to deploy, and currently more battle-tested.
The Honest Tension
The vision is compelling: a model that understands how the physical world works should be better at acting in it than a model that just maps observations to actions. A chess player with a mental model of the board should beat one running on pattern recognition alone.
But we don't have strong evidence yet that video world models have crossed the threshold from "interesting research" to "necessary infrastructure." VLAs work well enough for many current deployments. Teleoperation data, while expensive, produces reliable policies. The world model advantage may be real but incremental, not transformative. At least for now.
The bull case: we're at the GPT-2 stage. Policy evaluation and synthetic data are the wedge. As models scale and physical accuracy improves, direct control becomes viable, and the advantage compounds. The $10B bet is that this curve will steepen.
The most concrete value today is in testing and evaluation, not in direct robot control. Whether world models eventually become the foundation of all robotics AI, or remain a useful but non-essential tool, is still an open question.
5. The $10 Billion Bet
Over $10 billion has been deployed into world model and robotics AI companies in the past 18 months. The capital tells you where the industry actually is, not where the papers say it should be.
The money sits in four layers. At the bottom: pure world model companies building the simulator itself (AMI Labs, World Labs, Runway, Odyssey, Rhoda, Decart, Embo — roughly $4 billion combined). Above them: robot foundation model companies that use world models as a component (Skild, Physical Intelligence, Figure, Mind Robotics — roughly $6 billion). Then the platforms: NVIDIA and Google DeepMind, who build and open-source the infrastructure. And at the edges: big tech pivots, like OpenAI's Sora team redirecting to robotics (before the team lead left entirely).
The pattern is notable: companies using world models have raised more than companies building them. Either the world model layer is underfunded relative to its importance, or the largest robotics companies will build this capability in-house. 1X Technologies has already done exactly that.
NVIDIA: The 800-Pound Gorilla
The most important strategic development in the space isn't a startup. It's NVIDIA building every layer of the physical AI stack and open-sourcing it.
The stack runs from Cosmos Predict 2.5 (video foundation model, 14B params, 200M video clips) through DreamDojo (action-conditioned world model, 44K hours of human video, r=0.995 policy evaluation) through DreamZero (joint video + action prediction, zero-shot on unseen tasks) through EgoScale (the scaling law: R²=0.9983 between human video hours and robot performance) up to GR00T N2 (productized robot brain, end of 2026). Every layer is open-source Apache 2.0.
The strategy is CUDA for physical AI: give away the software, sell the hardware. DreamZero runs at 7Hz, but only on Blackwell GB200. Not real-time on H100. If every robotics company builds on this stack, they all need Blackwell.
For startups building pure world models, this is existential. DreamDojo is free and trained on 44,000 hours of video. "We built a world model" is no longer a moat. The differentiation has to come from domain-specific data NVIDIA doesn't have, faster inference, or vertical integration into a product that's more than just the model.
Where We See Opportunity
If the world model layer is being commoditized by NVIDIA, where is the defensible value?
Vertical-specific world models. NVIDIA's stack is general-purpose. A company that builds a world model specifically for surgical robotics, or warehouse manipulation, or food preparation, with proprietary data from actual deployments, could build a moat that general-purpose models can't match. The analogy: Bloomberg Terminal vs. ChatGPT. Both do language, but Bloomberg's data moat makes it irreplaceable.
The picks-and-shovels layer. Companies that make world models more useful without being the model itself: evaluation platforms, sim-to-real transfer tools, data pipelines for egocentric video. Inference optimization for real-time deployment. These are less glamorous than "we built a world model" but potentially more defensible.
The "world model inside a product" play. Companies where the world model is embedded in a vertically integrated robotics product, not sold as infrastructure. The model is a component of a robot that does a specific job, not a standalone API. This is harder to build but harder to commoditize.
6. What's Actually at Stake
The field is moving fast enough that some of these questions may be answered by the time you read this. But three stand out.
Can startups survive NVIDIA's commoditization? The open-source stack is comprehensive and free. Startups need an answer beyond "we also built a world model." The viable paths are narrow: proprietary data, specialized domains, or products where the world model is embedded in something bigger.
When does "good enough" physics become actual physics? The EgoScale scaling law suggests brute-force data can sidestep physical understanding for in-distribution tasks. But the gap between interpolation and extrapolation is where robots break things and hurt people. Whether data scaling can close this gap, or whether it requires fundamentally different approaches, remains the deepest open question.
Will the value materialize? World models have clear value in policy evaluation and AV simulation today. The bet is that this extends to direct robot control as models scale. But VLAs are currently simpler and more practical. The world model advantage may be real but incremental, or it may be transformative once a threshold is crossed. The honest answer: we don't know yet.
Two research traditions, one from reinforcement learning and one from video generation, spent decades developing in parallel and recently merged into something new. Whether that new thing lives up to the name "world model," whether it genuinely understands the physical world or just generates plausible video of it, is the question that $10 billion is riding on.
References
Papers
Schmidhuber, J. (1990). "Making the World Differentiable." Technical Report FKI-126-90, TU Munich.
Craik, K. (1943). The Nature of Explanation. Cambridge University Press.
Oh, J. et al. (2015). "Action-Conditional Video Prediction using Deep Networks in Atari Games." NeurIPS 2015.
Finn, C., Goodfellow, I. & Levine, S. (2016). "Unsupervised Learning for Physical Interaction through Video Prediction." NeurIPS 2016.
Ha, D. & Schmidhuber, J. (2018). "World Models." NeurIPS 2018. Interactive demos: worldmodels.github.io
Hafner, D. et al. (2019). "Learning Latent Dynamics for Planning from Pixels." ICML 2019. (PlaNet)
Schrittwieser, J. et al. (2020). "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model." Nature. (MuZero)
Hafner, D. et al. (2020). "Dream to Control: Learning Behaviors by Latent Imagination." ICLR 2020. (Dreamer V1)
Hafner, D. et al. (2021). "Mastering Atari with Discrete World Models." ICLR 2021. (DreamerV2)
Nair, S. et al. (2022). "R3M: A Universal Visual Representation for Robot Manipulation." CoRL 2022.
Baker, B. et al. (2022). "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos." NeurIPS 2022.
Wu, P. et al. (2022). "DayDreamer: World Models for Physical Robot Learning." CoRL 2022. Project: danijar.com/project/daydreamer
LeCun, Y. (2022). "A Path Towards Autonomous Machine Intelligence." OpenReview.
Hafner, D. et al. (2025). "Mastering Diverse Domains through World Models." Nature. (DreamerV3)
Bruce, J. et al. (2024). "Genie: Generative Interactive Environments." ICML 2024. (Genie 1)
Yang, S. et al. (2024). "Learning Interactive Real-World Simulators." ICLR 2024 Outstanding Paper. (UniSim). Project: universal-simulator.github.io/unisim
Valevski, D. et al. (2024). "Diffusion Models Are Real-Time Game Engines." (GameNGen). Project: gamengen.github.io
Yin, T., Huang, X. et al. (2025). "From Slow Bidirectional to Fast Autoregressive Video Diffusion Models." CVPR 2025. (AR-DiT / CausVid)
Huang, X. et al. (2025). "Self Forcing." NeurIPS 2025.
Hafner, D. & Yan, W. (2025). "Training Agents Inside of Scalable World Models." (Dreamer 4)
Jang, J. et al. (2025). "DreamGen: Unlocking Generalization in Robot Learning through Video World Models." CoRL 2025.
Gao, S., Liang, W. et al. (2026). "DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos." Project: dreamdojo-world.github.io
Ye, S., Ge, Y. et al. (2026). "DreamZero: World Action Models as Zero-shot Policies." Project: dreamzero0.github.io
Zheng, K., Niu, D. et al. (2026). "EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data."
Physical Intelligence. (2025). "Pi-0.5: a Vision-Language-Action Model with Open-World Generalization."
Blog Posts & Analysis
Xun Huang, "Towards Video World Models"
Bessemer Venture Partners, "Can World Models Unlock General Purpose Robotics?" (March 2026)
Google DeepMind, "Genie 3: A New Frontier for World Models"
OpenAI, "Video Generation Models as World Simulators" (February 2024)
Company & Industry News
"OpenAI is shutting down its Sora video app" — CNN, March 2026
"Kevin Weil and Bill Peebles exit OpenAI" — TechCrunch, April 2026
"Yann LeCun's AMI Labs raises $1.03 billion" — TechCrunch, March 2026
"Rhoda AI exits stealth with $450 million" — BusinessWire, March 2026
"Ex-Google DeepMind researchers raising $100 million to build world models" (Embo) — The Information, March 2026
"Decart's AI simulates a real-time, playable version of Minecraft" — TechCrunch, October 2024
"xAI to launch AI-powered video game by 2026" — India TV News, October 2025
Projects & Demos
worldmodels.github.io — Ha & Schmidhuber interactive demos
DreamerV3 project page — Hafner
DreamDojo GitHub — NVIDIA, open-source Apache 2.0
DreamZero evaluation gallery — 115 zero-shot task demos
Oasis 2.0 — Decart, playable AI-generated game