JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

BibTex

Copy

@misc{li2025jarvisvlaposttraininglargescale,
      title={JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play  Visual Games with Keyboards and Mouse},
      author={Muyao Li and Zihao Wang and Kaichen He and Xiaojian Ma and Yitao Liang},
      year={2025},
      eprint={2503.16365},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.16365},
}

GitHub

JarvisVLA

HTTPS

https://github.com/CraftJarvis/JarvisVLA

SSH

git@github.com:CraftJarvis/JarvisVLA.git

CLI

gh repo clone CraftJarvis/JarvisVLA

Transform this paper into an audio lecture

Get an engaging lecture and Q&A format to quickly understand the paper in minutes, perfect for learning on the go.

Audio lecture

Q&A format

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse