AHA-WAM

Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Jisong Cai1,2* Long Ling1,3* Shiwei Chu1 Zhongshan Liu3 Jiayue Kang1 Zhixuan Liang4,2 Wenjie Xu3 Yinan Mao3 Weinan Zhang1,2 Xiaokang Yang1 Ru Ying3 Ran Zheng3 Yao Mu1,2†
1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3Baidu AI Cloud 4The University of Hong Kong
1Shanghai Jiao Tong University
2Shanghai AI Laboratory
3Baidu AI Cloud
4The University of Hong Kong

Abstract

Asynchronous Horizon-Adaptive World-Action Modeling

World-action models inject physical priors into robot policies by learning visual scene dynamics together with actions. Existing formulations, however, bind world prediction and action execution to the same short temporal rhythm. AHA-WAM reorganizes this interface around temporal asymmetry: a low-frequency video DiT acts as a long-horizon world planner with rolling K/V memory, while a high-frequency action DiT executes short closed-loop action chunks by querying reusable layerwise planner context. Horizon-adaptive offset training makes the executor robust to planner-executor phase shifts, and Observation-Guided Video-Context Routing adapts cached planner context to the latest visual observation without rerunning the video DiT. Across RoboTwin 2.0 and real robots, AHA-WAM reaches 92.80% average RoboTwin success, 78.33% original-setting real-world success, and 24.17Hz closed-loop control; the distilled AHA-WAM-Flash variant further reaches 56.95Hz.

Introduction

Overview

AHA-WAM separates what should be slow from what must be fast. The video branch forms a reusable long-horizon latent world plan; the action branch repeatedly adapts that plan to the current observation and produces executable action chunks in a high-frequency control loop.

AHA-WAM overview with RoboTwin, real-world and latency results
RoboTwin 2.0 92.80%

Average success across 50 dual-arm tasks, without robot-data pretraining.

Real-World Tasks 78.33%

Original-setting success across four physical bimanual tasks.

AHA-WAM-Flash 56.95Hz

ODE-distilled closed-loop frequency with a 10.82x speedup over Fast-WAM.

Method

Asynchronous Horizon-Adaptive World-Action Modeling

AHA-WAM reorganizes world-action modeling around temporal asymmetry. The method first separates long-horizon world planning from high-frequency action execution, then keeps the reused planner context aligned with the latest observation and stable across asynchronous planner-executor phases.

AHA-WAM dual-DiT architecture and attention mask
The full AHA-WAM pipeline: slow video-DiT world planning, observation-guided context routing, and fast action-DiT execution.

Planner-Executor Architecture

Slow world planning, fast closed-loop execution.

AHA-WAM keeps a dual-DiT world-action model but assigns each branch a different temporal role. The video DiT predicts long-horizon visual latents and exposes layerwise planner context; the action DiT receives proprioception directly and consumes routed planner context through layerwise joint attention to denoise short executable chunks.

Layerwise Coupling

Planner context is reused without decoding future frames.

One video-DiT refresh exposes latent K/V context that can be amortized over multiple action-DiT updates. The action branch therefore benefits from learned visual dynamics while avoiding expensive future-frame rollout in the per-update control loop.

Observation-Guided Video-Context Routing

Cached planner context is adapted before every action chunk.

Asynchronous execution reuses one planner context across multiple action updates, which can make the context stale. OVCR builds compact routing queries from the latest visual observation, reads the planner K/V context, and applies gated residual updates so each action chunk sees a planner representation aligned with current visual evidence.

Horizon-adaptive offset training timeline

Horizon-Adaptive Offset Training

Training matches asynchronous deployment phases.

The action grid is randomly shifted inside the video horizon, teaching the executor to consume planner context under the phase offsets created by asynchronous streaming.

Rolling Planner Memory

The planner remembers past visual states.

The low-frequency video planner maintains a fixed-size FIFO rolling K/V memory over recent planner refreshes. This extends the planner's temporal receptive field for long-horizon tasks where completed subgoals or displaced objects may no longer be fully visible.

Streaming Inference and Real-Time Optimization

The video DiT leaves the per-update critical path.

At deployment, planner prefill and action execution run as non-blocking streams. TensorRT, CUDA-graph capture, loop-invariant hoisting, and ODE distillation reduce action-update latency while preserving the same planner-context and OVCR interface.

Experiments

Simulation, Ablations, Real-World Deployment, and Frequency

The experiments are organized around five matched questions: simulation capability, component mechanism, real-robot deployability, out-of-distribution robustness, and closed-loop efficiency. The tabs keep these axes parallel with the paper's tables and figures.

Table 1

RoboTwin 2.0 Average Success on 50 Tasks

Real Robot

Four Bimanual Manipulation Tasks

AHA-WAM is deployed on two AgileX Piper arms using ego-view RGB observations. The four tasks are intentionally complementary: rigid-object placement, deformable manipulation, long-horizon multi-object organization, and fine-grained contact-rich tool use. Select a task to compare original and generalization rollouts under the same model layout.

Store Plate Rollouts

Original and Generalization Videos

Compare the same task across Motus, Fast-WAM, π0.5, and AHA-WAM.

Supplementary Videos

Runtime Comparisons and AHA-WAM Demos

Runtime clips isolate the effect of the asynchronous control loop, while the demo rollouts collect representative AHA-WAM successes across original and shifted task settings.

Runtime and Control Smoothness

Side-view runtime comparisons for Motus are paired with the available AHA-WAM rollout views.

Motus · No RTC / Interpolation
Motus · With RTC + Action Interpolation
AHA-WAM · No action interpolation
AHA-WAM · With action interpolation

AHA-WAM Demo Rollouts

Prepare Soy Milk · Original
Prepare Soy Milk · Background Shift
Fold Cloth · Green Cloth
Fold Cloth · Yellow Cloth
Stack Plates · Original
Store Plate · Top Light
Store Plate · Side Light
Organize Desktop

Citation

BibTeX

@article{cai2026ahawam,
  title={AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing},
  author={Cai, Jisong and Ling, Long and Chu, Shiwei and Liu, Zhongshan and Kang, Jiayue and Liang, Zhixuan and Xu, Wenjie and Mao, Yinan and Zhang, Weinan and Yang, Xiaokang and Ying, Ru and Zheng, Ran and Mu, Yao},
  journal={arXiv preprint arXiv:2606.09811},
  year={2026}
}