World Models: From Prediction to Planning, HWM and the Challenge of Long-Horizon Control

特邀专栏作者

2026-04-17 10:20

This article is about 2700 words, reading the full article takes about 4 minutes

On April 3rd, a team from NYU and Meta FAIR published a paper titled "Hierarchical Planning with Latent World Models (HWM)". Original paper link: (https://arxiv.org/abs/2604.03208). This paper shifts its focus away from generating more realistic future frames, turning instead to a long-standing execution challenge for world models. Once the task chain is extended, prediction errors accumulate continuously, and the action search space expands rapidly.

AI Summary

Expand

Core Insight: The research focus of world models is shifting from enhancing internal predictive capabilities to building executable system capabilities that integrate prediction, planning, and verification, aiming to solve the problems of error accumulation and planning complexity in long-horizon, multi-stage tasks.
Key Elements:
1. V-JEPA 2, pre-trained on over a million hours of video, demonstrates the potential of world models in representation learning and basic prediction, providing a foundation for subsequent planning.
2. By introducing a hierarchical planning structure, HWM decomposes long tasks into high-level phase paths and low-level local actions, increasing success rates in real-world grasping tasks from 0% to 70%.
3. Hierarchical planning not only improves task success rates but also reduces planning computational costs in some scenarios to approximately one-quarter of the original.
4. The WAV model focuses on a model's ability to identify and correct its own prediction distortions, representing a development direction for system verification capabilities.
5. Research trends indicate that world models are evolving from merely predicting the future towards integrated system capabilities encompassing prediction, planning, and verification, to address the challenges of long-chain, multi-stage tasks.

Introduction

Over the past year, the research focus in world models has initially centered on representation learning and future prediction. Models first understand the world, then internally simulate future states. This path has already yielded a number of representative results. V-JEPA 2 (Video Joint Embedding Predictive Architecture 2—a video world model suite released by Meta in 2025), pre-trained on over 1 million hours of internet video and then combined with a small amount of robot interaction data, demonstrates the potential of world models in understanding, prediction, and zero-shot robot planning.

However, a model's ability to predict does not equate to its ability to handle long-horizon tasks. When faced with multi-stage control, systems typically encounter two pressures. One is that prediction errors continuously accumulate during long rollouts (sequential multi-step simulation), causing the entire path to increasingly deviate from the target. The other is that the action search space expands rapidly as the horizon (planning distance) grows, leading to continuously rising planning costs. HWM does not rewrite the underlying learning path of world models; instead, it adds a hierarchical planning structure on top of existing action-conditioned world models, allowing the system to first organize stage paths and then handle local actions.

From a technical perspective, V-JEPA 2 (https://ai.meta.com/research/vjepa/) leans more towards world representation and fundamental prediction, HWM leans more towards long-horizon planning, and WAV (World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry, https://arxiv.org/abs/2604.01985) leans more towards a model's ability to identify and correct its own prediction distortions. These three lines are gradually converging. The focus of world model research has shifted from merely predicting the future to how to transform predictive capabilities into executable, correctable, and verifiable system capabilities.

1. Why Long-Horizon Control Remains a Bottleneck for World Models

The difficulty of long-horizon control becomes clearer when placed in robot tasks. Taking robotic arm manipulation as an example, picking up a cup and placing it in a drawer is not a single action but a sequence of continuous steps. The system must approach the object, adjust its posture, complete the grasp, move to the target location, and then handle the drawer and placement. Once the chain becomes long, two problems appear simultaneously. One is that prediction errors continuously accumulate along the rollout, and the other is that the action search space expands rapidly.

What the system typically lacks is not local prediction ability, but the ability to organize distant goals into stage paths. Many actions, when viewed locally, may seem to deviate from the goal, but are actually intermediate steps required to complete it. For example, raising the arm before grasping, or moving back slightly and adjusting the angle before opening a drawer.

In demonstration tasks, world models are already capable of providing coherent predictions. However, upon entering real control scenarios, performance begins to decline, and problems emerge. The pressure comes not only from the representation itself but also from the planning layer not being mature enough.

2. How HWM Restructures the Planning Process

HWM splits the originally single-layer planning process into two layers. The upper layer is responsible for stage direction on a longer timescale, while the lower layer is responsible for local execution on a shorter timescale. The model plans not at a single rhythm but simultaneously at two different temporal rhythms.

When handling long tasks, single-layer methods typically need to directly search the entire action chain within the underlying action space. The longer the task, the higher the search cost, and the more easily prediction errors diffuse along multi-step rollouts. After HWM splits the process, the high-level layer only handles route selection on a longer timescale, and the low-level layer only handles the completion of actions in the current segment. The entire long task is broken into multiple shorter tasks, reducing planning complexity.

There is also a key design here: high-level actions are not simply the difference between two states; instead, an encoder is used to compress a segment of low-level actions into a higher-level action representation. For long tasks, the key is not just the difference between the start and end points, but also how the intermediate steps are organized. If the high-level layer only looks at displacement differences, it easily loses the path information within that action chain.

HWM embodies a hierarchical task organization approach. Faced with a multi-stage job, the system no longer unfolds all actions at once but first forms a coarser stage path, then executes and corrects segment by segment. Once this hierarchical relationship is incorporated into the world model, predictive capabilities begin to transform more stably into planning capabilities.

3. From 0% to 70%: What the Experimental Results Indicate

In the real-world grasp-and-place tasks set up in the paper, the system only receives the final goal condition, with no manually decomposed intermediate goals provided. Under these conditions, HWM achieved a success rate of 70%, while the single-layer world model had a 0% success rate. A long task that was originally almost impossible to complete became a highly probable outcome after introducing hierarchical planning.

The paper also tested simulation tasks such as object pushing and maze navigation. The results show that hierarchical planning not only improved success rates but also reduced computational costs during the planning phase. In some environments, planning phase computational costs could be reduced to about a quarter of the original while maintaining higher or comparable success rates.

4. From V-JEPA to HWM to WAV

V-JEPA 2 represents the world representation path. V-JEPA 2 was pre-trained on over 1 million hours of internet video and then combined with less than 62 hours of robot video for post-training (targeted training after pre-training), resulting in a latent action-conditioned world model (a world model that predicts in an abstract representation space incorporating action information) usable for understanding, predicting, and planning in the physical world. It demonstrates that models can acquire world representations through large-scale observation and transfer this representation to robot planning.

HWM is the next step. The model already possesses world representation and basic prediction capabilities, but once it enters multi-stage control, the problems of error accumulation and search space expansion erupt. HWM does not change the underlying representation learning path; instead, it adds a multi-timescale planning structure on top of existing action-conditioned world models. It addresses the problem of how the model organizes a distant goal into a set of intermediate steps and then advances segment by segment.

WAV further shifts the focus to verification capabilities. For world models to enter policy optimization and deployment scenarios, they cannot just predict; they must also be able to identify areas where they are prone to distortion and correct accordingly. It focuses on how the model checks itself.

V-JEPA leans towards world representation, HWM leans towards task planning, and WAV leans towards result verification. Although the three have different focuses, their general direction is consistent. The next phase for world models is no longer just internal prediction, but the gradual integration of prediction, planning, and verification into a cohesive system capability.

5. From Internal Prediction to Executable Systems

Many past world model efforts were closer to improving the continuity of future state predictions or enhancing the stability of internal world representations. However, the current research focus has begun to change. Systems must now both form judgments about the environment and translate those judgments into actions, continuing to correct the next steps after results emerge. To get closer to real-world deployment, it is necessary to control error propagation, compress search scope, and reduce inference costs in long-horizon tasks.

Such changes will also affect AI agents. Many agent systems can already complete short-chain tasks, such as calling tools, reading files, and executing multi-step instructions. But once tasks become long-chain, multi-stage, and require mid-course replanning, performance declines. This is not fundamentally different from the difficulties in robot control; both stem from insufficient high-level path organization capability, leading to a disconnect between local execution and overall goals.

The hierarchical approach provided by HWM—high-level responsible for paths and stage goals, low-level responsible for local actions and feedback processing, layered with result verification—this type of hierarchical structure will continue to appear in more systems in the future. The next phase for world models will also no longer focus solely on predicting the future, but on organizing prediction, execution, and correction into a path that can be run.

Web 4.0

Welcome to Join Odaily Official Community