The DiffSim Trinity

Extending differentiable simulation to world models. — **Analytic World Models** use the differentiable dynamics of an environment to predict next, desired, and counterfactual states.

tl;dr
We embed the simulator's differentiable dynamics into the computation graph of different world-models to learn predictive, prescriptive, and counterfactual state predictors efficiently. DiffSim lets us solve these inverse/what-if problems without trial-and-error search.

Abstract

Differentiable simulators represent an environment's dynamics as a differentiable function. Within robotics and autonomous driving, this property is used in Analytic Policy Gradients (APG), which relies on backpropagating through the dynamics to train accurate policies for diverse tasks. Here we show that differentiable simulation also has an important role in world modeling, where it can impart predictive, prescriptive, and counterfactual capabilities to an agent. Specifically, we design three novel task setups in which the differentiable dynamics are combined within an end-to-end computation graph not with a policy, but a state predictor. This allows us to learn relative odometry, optimal planners, and optimal inverse states. We collectively call these predictors Analytic World Models (AWMs) and demonstrate how differentiable simulation enables their efficient, end-to-end learning. In autonomous driving scenarios, they have broad applicability and can augment an agent's decision-making beyond reactive control.

Method – Analytic World Models

Compared to control, where the focus is on the prediction of actions, world modeling is concerned with the prediction of states. This is a rich and nuanced setting that includes next-states resulting from an action sequence, desirable states not conditioned on any actions, and counterfactual states answering "what-if" questions. All of these require understanding the world's dynamics, which is where differentiable simulation comes in. Hence, when applied to world modeling, DiffSim enables the backpropagation of gradients through the dynamics and into the world predictors.

The benefits of differentiable simulation. — DiffSim losses are computed in the state space, where the environment's dynamics condition any obtained state.

DiffSim is useful because it embeds the environment's dynamics into the training loop. Consider a setting in which the goal is to learn an action such that after execution in the simulator, a loss outcome is minimal. Behavior cloning works by directly supervising the prediced action without regards to the dynamics. Even though the model learns a distribution centered on the right action (red, left), if the dynamics are nonlinear, for some of the action values an undesirably high loss will be obtained. DiffSim methods compute the loss directly in the outcome space, considering the dynamics. As a result, the learned action distribution is tighter and the obtained loss is smaller. This pattern holds also when learning world models.

The concept of a world model is nuanced, as there are different ways to understand the effect of one's own actions. We formulate three task setups related to world modeling.

The effect of an agent's action could be understood as the difference between the agent's next state, resulting from that action, and its current state. This setup has an odometric interpretation, asking the question “Where will the agent go?”.
An agent could predict a desired next state to visit, a form of state planning. It asks the question “Where should it go?”.
We can ask “Given an action in a particular state, what should that state be so this action is optimal?”, which is a form of world modeling but also an inverse problem, effectively asking the counterfactual “Where should the agent have been?”.

All these tasks can be solved using a trial-and-error approach as in RL, where the agent has to explore and search. Yet, our DiffSim designs solve these tasks efficiently, without search, from direct supervision. We call the corresponding predictors Analytic World Models (AWMs). Full details on the precise optimization problems are available in the paper.

Experiments and Results

AWM results. — AWM outputs: prescriptive planning and controllable predictive rollouts in varied scenes.

We evaluate in Waymax, over the Waymo Open Motion Dataset, with predictive, prescriptive, and counterfactual AWMs. First, consider the prescriptive model, also called a planner because it imagines the next desirable state which the agent should reach. The top row shows different scenes in which the ego-vehicle drives by imagining the next desired state and using the environment's inverse kinematics to obtain the action that reaches it. The evaluation shows that our agent can drive just as well when predicting desired next state, as when predicting low-level actions.

Second, the bottom row in the above figure shows qualitative predictions from the predictive world model, which is a next-state predictor. This world model allows the agent to imagine the trajectory resulting from a given action sequence. We can manually set the actions to represent a left or right turn, or straight acceleration. As the agent is executing these actions in the simulator, it imagines the next one second of its motion, at different points in time (shown as multiple sets of different-colored points). The imagined trajectory closely aligns with the real, executed trajectory. The agent is able to imagine its future motion accurately.

AWM counterfactual predictions. — The colored trajectory highlights the ego-vehicle's inverse confidence in its actions, which represent risky over-acceleration.

Third, to demonstrate its counterfactual capabilities, we train a predictor that given the current state and an action, estimates an alternative state in which this action would have been optimal (see full paper for details). This is still a form of world modeling and differentiable simulation allows us to efficiently find this state. Once estimated, the agent can then use it as a confidence metric for its action. If this alternative state is far from the current one, it means the current action is not optimal. Based on this, in the figure above we color the trajectory according to the agent's confidence in its actions. In this scenario the agent over-accelerates and deviates from the expert trajectory. As a result, the displacement from the counterfactual predictor increases, and similarly, its confidence decreases.

BibTeX


@inproceedings{nachkov2025unlocking,
  title={Unlocking Efficient Vehicle Dynamics Modeling via Analytic World Models},
  author={Nachkov, Asen and Paudel, Danda Pani and Zaech, Jan-Nico and Scaramuzza, Davide and Van Gool, Luc},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2026}
}