Highlights
Highlights
Comparison of video world models for autonomous driving. (a) Conventional video world models typically serve as data engines for simulation in pixel space, operating in a decoupled manner where world modeling and planning are separate processes. (b) Unified world models perform video generation and planning as separate tasks within the same architecture, but without explicit synergy between the two components. (c) Our proposed Policy World Model performs planning based on the learned world knowledge, enabling collaborative state-action prediction that mimics human-like anticipatory perception.
Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs.
Policy World Model for Unified Forecasting and Planning. Policy World Model (PWM) introduces a unified framework
that integrates world modeling and trajectory planning. Unlike existing approaches that decouple world simulation from
decision-making, PWM leverages learned world knowledge to enhance planning through collaborative state-action prediction.
(a) PWM leverages its pre-trained world modeling to generate future frames, enabling seamless collaboration between
perception, prediction, and planning. The framework performs action-free future forecasting by first generating textual
descriptions of the current environment, then forecasting future video frames based on learned world knowledge, and
finally predicting optimal actions by considering both current observations and anticipated future states.
(b) Future video frames are compressed into compact latent representations guided by the initial frame, enabling
efficient parallel token generation while maintaining high-quality visual information for planning decisions. This
action-free forecasting scheme enables more reliable planning performance while maintaining training scalability.
Key Features:
Unified Architecture • Human-like Anticipatory Perception • Efficient Video Forecasting with Dynamic Parallel Token Generation
Pipeline for video world modeling. (a) World modeling is conducted on action-free, highly compressed video data using dynamically enhanced parallel prediction. Our system employs a context-guided tokenizer that compresses each image into only 28 tokens, enabling parallel generation of all tokens within a single frame. This approach allows video synthesis through next-frame prediction rather than next-token prediction, significantly accelerating the forecasting process. (b) Comparison of token prediction formats and attention interactions, showing how our method differs from traditional sequential token prediction approaches in terms of efficiency and temporal modeling capabilities.
Performance Comparison: Our Policy World Model achieves state-of-the-art performance on benchmark datasets. Despite using only front camera input, our method matches or exceeds approaches that rely on multi-view and multi-modal inputs. The results demonstrate the effectiveness of our collaborative state-action prediction framework.
Videos demonstrating Policy World Model's performance will be available soon. These demonstrations will showcase our method's capability in collaborative state-action prediction and action-free future forecasting for autonomous driving scenarios.
Coming Soon: Autonomous Driving Demonstrations
Video demonstrations will be added here to showcase Policy World Model's performance in various driving scenarios, including world modeling, future state forecasting, and trajectory planning capabilities.
@inproceedings{zhao2025pwm,
title={From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction},
author={Zhao, Zhida and Fu, Talas and Wang, Yifan and Wang, Lijun and Lu, Huchuan},
booktitle={Advances in Neural Information Processing Systems},
year={2025}
}