alphaXiv

History

Papers Benchmarks

Westlake Robotics

1,010

21 Feb 2025

computer-science computer-vision-and-pattern-recognition robotics

Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration

Zhejiang University

Westlake University Westlake Robotics

This paper introduces Humanoid-VLA, the first Vision-Language-Action (VLA) model designed for humanoid robots, enabling autonomous interaction and context-aware motion generation by integrating language, egocentric vision, and control. It addresses data scarcity through a self-supervised data augmentation strategy, outperforming prior text-to-motion models in quality and diversity, and achieving high success rates in real-world object interaction tasks on a Unitree G1 robot.

10 Dec 2025

computer-science robotics

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

Nanjing University

Zhejiang University

Westlake University HKUST(GZ)Westlake Robotics

Researchers from Westlake University and Zhejiang University developed HiF-VLA, a framework that leverages motion representations to integrate hindsight, insight, and foresight into Vision-Language-Action models. This approach effectively mitigates temporal myopia, enabling robots to perform long-horizon manipulation tasks with superior coherence and efficiency, achieving up to a 96.4% success rate on the LIBERO-Long benchmark.

21 Feb 2025

agents computer-science computer-vision-and-pattern-recognition

Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration

Zhejiang University

Westlake University Westlake Robotics Milab

This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions, allowing the model to learn universal motion patterns and action semantics. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation. Furthermore, we introduce a self-supervised data augmentation strategy that automatically generates pseudoannotations directly derived from motion data. This process converts raw motion sequences into informative question-answer pairs, facilitating the effective use of large-scale unlabeled video data. Built upon whole-body control architectures, extensive experiments show that Humanoid-VLA achieves object interaction and environment exploration tasks with enhanced contextual awareness, demonstrating a more human-like capacity for adaptive and intelligent engagement.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration

Events

AI for Law

Personalize Your Feed