alphaXiv

History

Papers Benchmarks

Galaxea AI

318

15 Oct 2025

computer-science computer-vision-and-pattern-recognition imitation-learning

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

Tsinghua University Galaxea AI IIIS, Tsinghua University

DepthVLA enhances Vision-Language-Action (VLA) models by integrating a pretrained depth prediction module within a mixture-of-transformers architecture, improving spatial reasoning for robotic manipulation tasks. The approach yields superior performance, achieving a 74.8% success rate on the simulated WidowX benchmark and a 79% average progress score on a real-world robotic platform.

123

08 Dec 2025

computer-science computer-vision-and-pattern-recognition robotics

FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization

Tianjin University

Fudan University

Tsinghua University

University of California, San Diego Shanghai Innovation Institute Hong Kong University Galaxea AI

FASTER is a unified framework featuring a learnable action tokenizer (FASTerVQ) and an efficient autoregressive policy (FASTerVLA) for robotic manipulation. It achieves state-of-the-art performance with a 97.9% success rate on Libero and significantly reduces inference latency to 112ms, enabling efficient real-time control for complex tasks.

312

03 Jun 2025

computer-science computer-vision-and-pattern-recognition robotics

VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion

Shanghai Jiao Tong University

Tsinghua University Galaxea AI Shanghai Qi-Zhi Institute

Recent success in legged robot locomotion is attributed to the integration of reinforcement learning and physical simulators. However, these policies often encounter challenges when deployed in real-world environments due to sim-to-real gaps, as simulators typically fail to replicate visual realism and complex real-world geometry. Moreover, the lack of realistic visual rendering limits the ability of these policies to support high-level tasks requiring RGB-based perception like ego-centric navigation. This paper presents a Real-to-Sim-to-Real framework that generates photorealistic and physically interactive "digital twin" simulation environments for visual navigation and locomotion learning. Our approach leverages 3D Gaussian Splatting (3DGS) based scene reconstruction from multi-view images and integrates these environments into simulations that support ego-centric visual perception and mesh-based physical interactions. To demonstrate its effectiveness, we train a reinforcement learning policy within the simulator to perform a visual goal-tracking task. Extensive experiments show that our framework achieves RGB-only sim-to-real policy transfer. Additionally, our framework facilitates the rapid adaptation of robot policies with effective exploration capability in complex new environments, highlighting its potential for applications in households and factories.

112

There are no more papers matching your filters at the moment.