alphaXiv

History

Papers Benchmarks

X SQUARE ROBOT

469

15 Sep 2025

computer-science robotics

Igniting VLMs toward the Embodied Space

X SQUARE ROBOT

X SQUARE ROBOT introduced WALL-OSS, an end-to-end embodied foundation model that systematically bridges the gaps between vision-language models and embodied AI through a tightly coupled architecture and multi-stage training. It demonstrates superior performance in embodied scene understanding, zero-shot instruction following, and complex robotic manipulation tasks.

510

07 Dec 2025

computer-science computer-vision-and-pattern-recognition robotics

MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Sun Yat-Sen University

Tsinghua University Central South University

HKUST China University of Geosciences X SQUARE ROBOT

MIND-V introduces a hierarchical video generation framework for long-horizon robotic manipulation, autonomously synthesizing physically plausible and logically coherent operation videos. It employs a multi-stage architecture with reinforcement learning for physical alignment, providing a scalable method for generating robot training data.

15 Sep 2025

computer-science robotics

Igniting VLMs toward the Embodied Space

X SQUARE ROBOT

While foundation models show remarkable progress in language and vision, existing vision-language models (VLMs) still have limited spatial and embodiment understanding. Transferring VLMs to embodied domains reveals fundamental mismatches between modalities, pretraining distributions, and training objectives, leaving action comprehension and generation as a central bottleneck on the path to AGI. We introduce WALL-OSS, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision-language understanding, (2) strong language-action association, and (3) robust manipulation capability. Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT-seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework. Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Igniting VLMs toward the Embodied Space

MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Igniting VLMs toward the Embodied Space

Events

AI for Law

Personalize Your Feed