The Next Half-Second

2026-05-19

Tight deadline, and the presentation won’t leave your head. You get up to grab a coffee, pour it, hip a chair out of the way and walk back. Every drop stays inside because your brain simulates the next half-second of physics, compares it to what your sensors report and corrects faster than you consciously notice.

When the mug is heavier than you expected or the floor slicker than it looked, you experience surprise; the model updates, and the brain simulates the next half-second more accurately. The leading account of cognition calls this predictive processing: the brain generates expected percepts, compares them against incoming sensory data and updates both its predictions and its motor outputs in real time. This loop runs underneath every action you have ever taken, from braking into a curve your tires may or may not grip to feeling the texture of a doorknob before your fingers close. Intelligence, in this account, is the slow shaping of the simulator by accumulated surprise across millions of action-update cycles; what you call experience is the rendering of a simulator that has been running, continuously, since you were born.1

1. Simulation

Now watch a video of someone else pouring water into a glass. You know, before the water hits the rim, whether it will overflow; you can feel the weight of the pitcher in a hand you are not holding. Now imagine you are the one pouring. The loop changes: you simulate forward to decide what should happen, and the distance between what you predicted and what you wanted is how you correct your grip, your angle and your speed.

A child learns physics from roughly 20,000 waking hours of unstructured sensory experience before she starts school: objects falling, liquids pouring, hands grasping and light scattering across surfaces she has never seen before. The physics are never explained; they are felt, over and over, until the model stops being surprised. She runs the loop, registers surprise, updates the model and runs it again; by the time she picks up a toy she has never seen before, those hours of reaching, grasping and dropping have built a model of how objects behave when you grip and lift. The breadth is the curriculum. Simulation converts raw sensory exposure into the ability to act in situations you have never encountered, and the feedback loop makes the conversion reliable: without surprise there is no update, and without an update the simulator never improves.

LLMs succeeded at coding and mathematics because those are domains of pure symbolic reasoning where the loop closes naturally: write the code, run it, observe the error and revise.2 The proof either checks or it doesn’t. Cause and effect stay close enough to touch. For the physical world, the equivalent loop requires a simulator that can render what will happen next, compare the rendering against what actually happens and use the distance between the two as a training signal dense enough to reshape the model. Every downstream task that depends on physical reasoning, a robot gripping an egg, a prosthetic hand anticipating the weight of a ceramic mug, an autonomous vehicle simulating black ice two seconds before the rear tires lose purchase, requires this loop to close.

2. World models

“World model” may have surpassed “AGI” as the term everyone uses and nobody agrees on. Narrow RL agents learned world models inside Atari games and simulated Go boards in the 2010s; LeCun proposed a latent world model as the path to autonomous intelligence; roboticists use the term to mean a dynamics model of a specific arm in a specific workspace; and the marketing departments of half a dozen startups use it to mean whatever they shipped last quarter.3

Strip away the branding and a world model needs three things: simulate what happens next in observable space, act by conditioning the simulation on actions so it answers “what happens if I do X,” and update by feeding surprise back into the simulator so each rendering improves on the last.

Code and mathematics gave LLMs an easy loop to close. The physical world does not pause while the planner deliberates; a dropped glass does not buffer. The loop here must run in rendered space, fast enough to act on, dense enough to feel the difference between silk and sandpaper.

3. Video as the foundation

Video generation models are the most promising foundation because they train on the world itself: endless hours of humans cooking, building, driving, failing and recovering, in every environment, lighting condition and physical regime.4 Rendering in observable space, pixels and forces and textures, the weight of a pitcher before you tilt it, the slide of a wheel across wet asphalt – provides something latent objectives discard: dense, generalizable error signals grounded in what sensors actually measure. You can compress these signals into latent vectors for planning and transfer, but the compression is always derived from sensory-level training and validated against sensory-level outputs.

Early results bear this out: a humanoid trained on video world models performed 22 new behaviors in unseen environments from a single pick-and-place demonstration;5 Runway’s GWM-Robotics, built on a video generation foundation, correlated at 0.95 with real-world policy outcomes across eight manipulation policies and 16,000 human evaluations, outperforming traditional real-to-sim approaches that require 3D scene reconstruction.6 The visual diversity of the training data is what lets them generalize; the same model that learned how liquids pour also learned how fabric drapes, how light shifts across a room at different times of day, how a stack of plates responds when you pull one from the middle. One set of rules, captured all at once.

Two prominent alternatives skip rendering in opposite directions. Latent prediction avoids pixel-level generation entirely, predicting the next abstract vector in a learned embedding space. The trade is real: it gains efficiency, transfers well across robot embodiments and can be grounded downstream through action-conditioned fine-tuning. The open question is whether latent pretraining alone captures enough about the physical world, or whether the information discarded during compression turns out to matter for generalization. Direct action models skip rendering from the other direction: they map observations and language instructions straight to motor commands, pattern-matching from demonstrations without imagining what will happen next. They are starved for data and walled inside their training distribution; each new dish requires new demonstrations. Even the leading direct-action systems have begun integrating lightweight world models for subgoal planning, a convergence that suggests pattern-matching alone is not enough.

Both approaches may prove essential as components. My bet is that the main simulation engine, the one that generalizes across physics and scales, will render in observable space – not only pixels but eventually force, texture, sound – whatever modality the task demands. Video is the highest-leverage starting point because the data already exists at scale, and it improves fastest precisely because it renders; a professional can watch the output, reject what is wrong and close the feedback loop in a way that latent representations do not easily afford.

4. Reasoning, planning and the feedback loop

A simulator alone produces futures; it does not know which futures matter. A reasoning layer evaluates rendered rollouts and decomposes goals, drawing on everything raw perception cannot touch: strategy, causation, counterfactual reasoning and social dynamics. It decides what to simulate next. Planning chains multiple renderings together to evaluate long sequences of actions. Motor control consumes the result and translates it into actuator commands. An undirected simulator is a projector running in an empty room. Reasoning without simulation is pure abstraction, disconnected from the physics it needs to act on.

The feedback loop binds them. The system does not merely plan over simulated experience but learns from it: the climber who fell grips differently the next time, not because someone told her to, but because the surprise reshaped her hands.

5. Crossing to physical space

The visual cortex evolved under the narrow pressure of predator detection and spatial navigation; the simulation machinery that learned to predict “is that shadow a leopard or a rock” turned out to be general enough to support reading, art and face recognition, none of which existed as selection pressures.7 The organ that evolved to keep primates alive in the canopy became the organ that renders everything they have ever experienced. Evolution did not need to cross into a new medium because the original training loop, sensory prediction in physical space, was rich enough.

Whether artificial simulators can stay purely digital depends on the problem. For tasks where the relevant physics can be captured in video, a digital loop may suffice: an architect simulating how light moves through a building, a filmmaker previewing a shot that has not been filmed. For robotics, where the model’s predicted force must match what the actuator delivers, the loop may need to cross into physical space. You feel the steering go light, your hands correct before the skid registers consciously; that correction depends on proprioceptive signals, not pixels. A simulator that has never touched the gap between commanded torque and actual torque will hallucinate what contact feels like.

I am not sure where the crossover sits, and it may shift as sim-to-real transfer narrows the domain gap. But the implication for world models is concrete: video is the highest-leverage foundation because it captures the broadest slice of physics from existing data, and as the simulator matures it will need to absorb additional modalities – force, proprioception and sound – to close the loop on tasks where pixels alone are not enough.

6. The product loop

You cannot build a world model in isolation. A simulator trained only in the lab optimizes for benchmarks the lab designed, and benchmarks are a map, not the territory. What matters to simulate, at what resolution, in what physical regime, at what timescale, depends on what someone is trying to do with the simulation. An architect needs accurate light transport through a building that does not exist yet; a prosthetics engineer needs a simulated hand that matches the weight and slip of real objects; a game designer needs plausible physics at real-time frame rates across an open world. You discover the tradeoffs only when real users push the simulator past what the research team anticipated.

As the CTO at Runway, my bet on where this goes is not disinterested. Runway works across filmmaking, robotics, gaming and real-time avatars because each domain pushes the simulator in a different direction – coherent light over long shots, accurate contact dynamics, real-time physics, expressive motion – and a world model that handles all of them is one that has actually learned how the world works.

You are still holding the coffee. You have been simulating its weight, its temperature and the flex of the cup in your grip, without noticing, since the first sentence of this essay. Everything you know about the physical world you learned this way: simulate the next half-second, get it wrong and let the surprise reshape the model until it stops getting it wrong. Build the simulator that gets the next half-second right, and the rest follows.

1

The predictive processing framework: Wolpert, D.M., Ghahramani, Z. & Jordan, M.I., “An internal model for sensorimotor integration,” Science 269 (1995): 1880-1882. Bazzi, S. et al., “Simplified internal models in human control of complex objects,” PLOS Computational Biology 20 (2024), confirmed humans use simplified forward models for real-time motor correction. Clark, A., “Whatever next? Predictive brains, situated agents, and the future of cognitive science,” Behavioral and Brain Sciences 36 (2013): 181-204, is the most comprehensive introduction. Friston, K., “The free-energy principle: a unified brain theory?” Nature Reviews Neuroscience 11 (2010): 127-138, provides the mathematical formalization. Rao, R.P.N. & Ballard, D.H., “Predictive coding in the visual cortex,” Nature Neuroscience 2 (1999): 79-87. Keller et al., “Higher-level spatial prediction in natural vision across mouse visual cortex,” PLOS Computational Biology (2026), confirms cortical responses are shaped by sensory predictability, though Gillon, C.J. et al., “Predictive coding: a more cognitive process than we thought?” Trends in Cognitive Sciences (2025), finds genuine prediction errors emerging primarily in prefrontal cortex, complicating the picture. Seth, A.K., Being You: A New Science of Consciousness (Dutton, 2021), calls perception a “controlled hallucination.” Parr, T., Pezzulo, G. & Friston, K.J., Active Inference (MIT Press, 2022), formalizes intelligence as hierarchical belief updating driven by prediction error.

2

The tight feedback loop in coding is literal: an LLM generates code, the interpreter runs it, the error message feeds back into the next generation. Mathematics has a similar loop through formal proof verification. Both are closed-world symbolic domains where the ground truth is computable.

3

The RL world-model lineage had action conditioning but narrow training data: Schmidhuber, “Making the World Differentiable,” TU Munich (1990); Ha & Schmidhuber, “World Models,” NeurIPS (2018); Hafner et al., “Mastering Diverse Domains through World Models,” Nature (2025). The video-prediction lineage had visual diversity but no interactivity: Nair et al., “R3M,” CoRL (2022); Baker et al., “VPT,” NeurIPS (2022). The convergence brought both together. LeCun, Y., “A Path Towards Autonomous Machine Intelligence,” OpenReview (2022). Assran, M. et al., “V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning,” (2025).

4

Hafner, D. et al., “Mastering Diverse Domains through World Models,” Nature (2025) (DreamerV3). Wu, P. et al., “DayDreamer: World Models for Physical Robot Learning,” CoRL (2022).

6

Runway Robotics, “Accelerating Robot Policy Evaluation with General World Models,” (2026). Pearson correlation of 0.95 across eight VLA policies evaluated on the RoboArena benchmark, surpassing PolaRiS (0.98 on matched subset but requiring 3D Gaussian splatting reconstruction).

7

The hippocampus evolved to simulate routes through physical space, and the same circuitry now supports episodic memory, counterfactual reasoning and imagination of futures that have never occurred.