alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

We're hiring
PaperBlogResources

Closing the Train-Test Gap in World Models for Gradient-Based Planning

BibTex
Copy
@misc{parthasarathyWed Dec 10 2025 18:59:45 GMT+0000 (Coordinated Universal Time)closingtraintestgap,
      title={Closing the Train-Test Gap in World Models for Gradient-Based Planning},
      author={Arjun Parthasarathy and Nimit Kalra and Rohun Agrawal and Yann LeCun and Oumayma Bounou and Pavel Izmailov and Micah Goldblum},
      year={Wed Dec 10 2025 18:59:45 GMT+0000 (Coordinated Universal Time)},
      eprint={2512.09929},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.09929},
}
GitHub
robust-world-model-planning
1
HTTPS
https://github.com/qw3rtman/robust-world-model-planning
SSH
git@github.com:qw3rtman/robust-world-model-planning.git
CLI
gh repo clone qw3rtman/robust-world-model-planning
AI Audio Lecture + Q&A
0:00 / 0:00
Closing the Train-Test Gap in World Models for Gradient-Based Planning
Transcript
John: Welcome to Advanced Topics in Model-Based Control. Today's lecture is on the paper 'Closing the Train-Test Gap in World Models for Gradient-Based Planning' from a team at Columbia and NYU. We've seen a lot of recent work, like 'Planning with Latent Dynamics Models,' that leverages large, pretrained visual encoders for control. This paper operates in that same space but focuses on a very specific and persistent problem: why gradient-based planning with these models often underperforms in practice, despite its theoretical efficiency. Yes, Noah? Noah: Excuse me, Professor. You mentioned gradient-based planning. Isn't that typically less effective than search-based methods like CEM in these world models? John: That's an excellent point, and it's precisely the problem this paper tackles. The authors identify a fundamental 'train-test gap.' The world model is trained offline on a fixed dataset of expert trajectories, but during planning, the optimizer explores action sequences that can lead to states the model has never seen before. Noah: So the planner itself is generating these out-of-distribution states by optimizing actions? It's like it's exploring parts of the state space the world model has never seen and doesn't know how to predict accurately. John: Exactly. Prediction errors begin to compound over the planning horizon, rendering the model's guidance useless. The second issue they identify is that the optimization landscape itself, induced by the world model, can be rugged and full of poor local minima, which is difficult for gradient descent to navigate. To solve this, they propose two finetuning methods. Noah: What are the two methods? John: The first is called Online World Modeling, or OWM. It's an iterative process. You use your current world model to generate a plan. Then, you execute that plan in the true environment simulator to get a corrected, ground-truth trajectory. You add this new, corrected data to your training set and finetune the model. You repeat this process, gradually expanding the model's knowledge to cover the states it actually encounters during planning. Noah: But doesn't OWM require constant access to a live simulator or environment? That seems computationally expensive and not always feasible. John: It does, which is a significant practical consideration. And that leads directly to their second, more self-contained approach: Adversarial World Modeling, or AWM. This method doesn't require a simulator for finetuning. Instead, it improves the model's robustness by training it on adversarially perturbed inputs. It makes the model more resilient to small changes in states and actions. John: Let's look at the technical approach. The foundation is a latent world model. They use a fixed, pretrained DINOv2 visual encoder to map images to a latent space, and then they train a Vision Transformer to predict the next latent state given the current one and an action. This is a common setup. The novelty is in the finetuning. For Adversarial World Modeling, they generate small perturbations to the input states and actions—specifically, perturbations designed to maximally increase the model's prediction error. Noah: So AWM is essentially a form of regularization. By training on these 'worst-case' local perturbations, the model is forced to learn a smoother function, which in turn makes the planning objective easier for a gradient-based optimizer? John: That's a very precise way to put it. The goal isn't just to defend against attacks in the traditional sense, but to smooth the loss landscape for the planner's gradient descent. It forces the model's output to be less sensitive to small changes in its input. The paper includes a visualization showing that this process creates a wider, more forgiving basin of attraction around the optimal action sequence. And the results are quite convincing. AWM, especially when combined with an Adam optimizer and used in a Model Predictive Control loop, achieves performance that matches or even exceeds the strong, search-based CEM baseline on several tasks. Noah: Matching CEM is a strong claim. But how much more efficient is it? Isn't that the whole point of using gradient-based planning? John: The efficiency gain is substantial. The paper reports a tenfold reduction in computation time compared to CEM. This is what moves gradient-based planning from a theoretical curiosity into a practical tool for real-time control, especially as action spaces and planning horizons grow. John: The implications here are significant. The work makes gradient-based planning a far more viable and attractive alternative to search-based methods. It also provides a practical strategy to mitigate the train-test distribution shift, which is a pervasive issue for any learned dynamics model. The OWM method is conceptually similar to dataset aggregation techniques from imitation learning, like DAgger, but adapted for latent dynamics. Noah: The use of adversarial training here feels different from its typical application in robustness for image classification. It's not about fooling a classifier, but about improving an optimization process. Is this a common technique in control? John: It's an emerging and clever application. It connects this work to the broader field of robust machine learning, but repurposes the tools. Instead of defending against an external adversary, they're using an 'internal' adversary to fix the model's own optimization landscape. It highlights a new way to use these techniques not just for robustness, but for improving the stability and performance of optimization in sequential decision-making. They also showed AWM improved a different architecture, IRIS, suggesting the principle is general. John: So, to wrap up, this research provides a clear demonstration that by directly addressing the train-test gap, gradient-based planning can become both effective and highly efficient. The Adversarial World Modeling method stands out as a practical, simulator-free technique to achieve this by smoothing the planning landscape. The key lesson is that the way a model is trained must align with how it will be used at test time. This work provides a concrete blueprint for closing that gap in the context of planning. John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.