From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction

We introduce Policy World Model (PWM), a unified framework that integrates world modeling and trajectory planning in a single architecture.

Our method leverages action-free future forecasting to mimic human-like anticipatory perception, enhancing planning performance without requiring action-labeled data.

We develop dynamic parallel token generation with context-guided compression, achieving efficient video forecasting while maintaining high-quality future state prediction.

Despite using only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs.

Comparison of video world models for autonomous driving. (a) Conventional video world models typically serve as data engines for simulation in pixel space, operating in a decoupled manner where world modeling and planning are separate processes. (b) Unified world models perform video generation and planning as separate tasks within the same architecture, but without explicit synergy between the two components. (c) Our proposed Policy World Model performs planning based on the learned world knowledge, enabling collaborative state-action prediction that mimics human-like anticipatory perception.

Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs.

Policy World Model for Unified Forecasting and Planning. Policy World Model (PWM) introduces a unified framework that integrates world modeling and trajectory planning. Unlike existing approaches that decouple world simulation from decision-making, PWM leverages learned world knowledge to enhance planning through collaborative state-action prediction.

(a) PWM leverages its pre-trained world modeling to generate future frames, enabling seamless collaboration between perception, prediction, and planning. The framework performs action-free future forecasting by first generating textual descriptions of the current environment, then forecasting future video frames based on learned world knowledge, and finally predicting optimal actions by considering both current observations and anticipated future states.

(b) Future video frames are compressed into compact latent representations guided by the initial frame, enabling efficient parallel token generation while maintaining high-quality visual information for planning decisions. This action-free forecasting scheme enables more reliable planning performance while maintaining training scalability.

Key Features: Unified Architecture • Human-like Anticipatory Perception • Efficient Video Forecasting with Dynamic Parallel Token Generation

Pipeline for video world modeling. (a) World modeling is conducted on action-free, highly compressed video data using dynamically enhanced parallel prediction. Our system employs a context-guided tokenizer that compresses each image into only 28 tokens, enabling parallel generation of all tokens within a single frame. This approach allows video synthesis through next-frame prediction rather than next-token prediction, significantly accelerating the forecasting process. (b) Comparison of token prediction formats and attention interactions, showing how our method differs from traditional sequential token prediction approaches in terms of efficiency and temporal modeling capabilities.

Performance Comparison: Our Policy World Model achieves state-of-the-art performance on benchmark datasets. Despite using only front camera input, our method matches or exceeds approaches that rely on multi-view and multi-modal inputs. The results demonstrate the effectiveness of our collaborative state-action prediction framework.

Video demonstrations will be added here to showcase Policy World Model's performance in various driving scenarios, including world modeling, future state forecasting, and trajectory planning capabilities.

BibTeX

@inproceedings{zhao2025pwm,
  title={From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction},
  author={Zhao, Zhida and Fu, Talas and Wang, Yifan and Wang, Lijun and Lu, Huchuan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2025}
}

From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction

NeurIPS 2025 Poster

Highlights

Method Comparison

Abstract

Method Overview

Video Generation Architecture

Performance Results

Video Demonstrations

BibTeX