OpenVLA: An Open-Source Vision-Language-Action Model

BibTex

Copy

@Article{Kim2024OpenVLAAO,
 author = {Moo Jin Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and A. Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag R. Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn},
 booktitle = {arXiv.org},
 journal = {ArXiv},
 title = {OpenVLA: An Open-Source Vision-Language-Action Model},
 volume = {abs/2406.09246},
 year = {2024}
}

GitHub

openvla

2361

HTTPS

https://github.com/openvla/openvla

SSH

git@github.com:openvla/openvla.git

CLI

gh repo clone openvla/openvla

AI Audio Lecture + Q&A

0:00 / 0:00

OpenVLA: An Open-Source Vision-Language-Action Model

Transcript

John: Welcome to our seminar on Generalist AI for Robotics. Today's lecture is on 'OpenVLA: An Open-Source Vision-Language-Action Model.' We've seen a lot of work in this space, with large, closed models from places like Google DeepMind setting performance benchmarks, alongside a push for open-source alternatives like 'Octo'. This paper, a collaboration between Stanford, Berkeley, and the Toyota Research Institute, directly enters that conversation. It challenges the idea that state-of-the-art performance requires proprietary, massive-scale models by introducing a powerful yet accessible alternative. Yes, Noah? Noah: Excuse me, Professor. You mentioned 'Octo.' How does OpenVLA differ? Isn't Octo also an open-source generalist policy? John: An excellent clarifying question. While both are open-source generalist policies, their core architectures differ. Octo is a transformer-based model designed from the ground up for robotics. OpenVLA, as its name suggests, is a Vision-Language-Action, or VLA, model. It specifically fine-tunes a large, pre-existing Vision-Language Model, in this case one based on Llama 2, to output robot actions. The core hypothesis is that you can leverage the vast knowledge already encoded in these web-scale foundation models for robotic control. Noah: So the contribution is less about a novel architecture and more about adapting an existing VLM for robotics in an open-source way? John: Precisely. The main objectives were threefold. First, to develop and release a high-performing, fully open-source VLA to counter the closed nature of models like Google's RT-2-X. This is about democratizing access. Second, to demonstrate that this open model could achieve state-of-the-art performance, even outperforming larger models. And third, a crucial and often overlooked aspect, is to explore practical and efficient methods for fine-tuning these models for new robots and tasks, making them genuinely useful for researchers without massive compute clusters. Noah: Okay, that makes sense. So how does it actually work? What's under the hood? John: The architecture is built on a VLM called Prismatic-7B. It has three main parts. First, a visual encoder processes the images. This is a key design choice: they fuse features from two different pretrained vision models, DINOv2 and SigLIP. The idea is that DINOv2 provides fine-grained spatial information, while SigLIP provides higher-level semantic understanding. These fused visual features are then projected into the language model's space. The backbone is a Llama 2 7B model. Finally, to generate actions, they discretize the continuous robot end-effector poses into tokens and train the model using a standard next-token prediction objective, just like a language model. Noah: Wait, they just turn robot movements into text tokens? And how do they add these new 'action tokens' to the Llama vocabulary? John: Yes, it's a clever way to reframe the problem. Each dimension of the robot's action is binned into one of 256 values. As for the vocabulary, they use a pragmatic solution: they find the 256 least-used tokens in Llama's original vocabulary and simply overwrite them with these new action tokens. The model is then fine-tuned on the massive Open X-Embodiment dataset, but the loss is only calculated on the predicted action tokens. This focuses the model's learning on the control task without catastrophically forgetting its language and vision capabilities. Noah: So did it work? How did it compare to closed models like RT-2-X? John: The results were quite significant. On a head-to-head comparison across 29 tasks, the 7-billion parameter OpenVLA outperformed the 55-billion parameter RT-2-X by a substantial margin. This finding suggests that architectural choices, like the fused vision encoder, and careful data curation might be more impactful than simply scaling up parameter count. Furthermore, the paper demonstrates highly effective adaptation. Using a technique called Low-Rank Adaptation, or LoRA, they could fine-tune OpenVLA for a new robot on a single consumer-grade GPU in about 10 hours, matching the performance of full fine-tuning which required eight times the compute. Noah: That's a huge deal for smaller labs. So the implication is that we can now get state-of-the-art performance and adapt it to our own robots without needing a datacenter? John: That is the central implication, yes. This work significantly lowers the barrier to entry for advanced robotics research. By open-sourcing not just the model, but the entire training and fine-tuning pipeline, it establishes a foundational platform for the community. It shifts the focus from a race for sheer scale to a more collaborative effort centered on accessible, efficient, and adaptable models. It also provides concrete evidence that leveraging pretrained web-scale models, when done carefully, is an extremely powerful approach for robotics, especially for generalization to novel instructions and objects. Noah: So it seems like the future is less about building robot-specific models from scratch, and more about effectively adapting these massive foundation models. John: That appears to be the direction the field is moving, and OpenVLA makes a strong case for it. In summary, the paper introduces a powerful, open-source VLA that not only sets a new state-of-the-art benchmark but also provides a practical roadmap for real-world adaptation. The key takeaway is the democratization of high-performance robotics AI, enabling broader community participation. It proves that progress isn't solely dependent on building bigger models, but also on building smarter, more accessible ones. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

OpenVLA: An Open-Source Vision-Language-Action Model