alphaXiv

History

Papers Benchmarks

robotics-perception

229

09 Dec 2025

robotics-perception attention-mechanisms autonomous-vehicles

Astra: General Interactive World Model with Autoregressive Denoising

Tsinghua University Kuaishou Technology

Astra, a collaborative effort from Tsinghua University and Kuaishou Technology, introduces an interactive general world model using an autoregressive denoising framework to generate real-world futures with precise action interactions. The model achieves superior performance in instruction following and visual fidelity across diverse simulation scenarios while efficiently extending a pre-trained video diffusion backbone.

134

10 Dec 2025

robotics-perception agents computer-science

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

Astribot

The Astribot Team developed Lumo-1, a Vision-Language-Action (VLA) model that explicitly integrates structured reasoning with physical actions to achieve purposeful robotic control on their Astribot S1 bimanual mobile manipulator. This system exhibits superior generalization to novel objects and instructions, improves reasoning-action consistency through reinforcement learning, and outperforms state-of-the-art baselines in complex, long-horizon, and dexterous tasks.

10 Dec 2025

robotics-perception agents autonomous-vehicles

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

ByteDance

UniUGP presents a unified framework for end-to-end autonomous driving, integrating scene understanding, future video generation, and trajectory planning through a hybrid expert architecture. This approach enhances interpretability with Chain-of-Thought reasoning and demonstrates state-of-the-art performance in challenging long-tail scenarios and multimodal capabilities across various benchmarks.

217

07 Dec 2025

robotics-perception computer-science artificial-intelligence

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Researchers from Microsoft Research Asia, Xi'an Jiaotong University, and Fudan University developed VideoVLA, a robot manipulator that repurposes large pre-trained video generation models. This system jointly predicts future video states and corresponding actions, achieving enhanced generalization capabilities for novel objects and skills in both simulated and real-world environments.

09 Dec 2025

robotics-perception computer-science computer-vision-and-pattern-recognition

TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

University of Science and Technology of China

The Chinese University of Hong Kong Xiamen University

The University of Hong Kong

HKUST Macau University of Science and Technology

Researchers at HKUST developed TrackingWorld, a framework for dense, world-centric 3D tracking of nearly all pixels in monocular videos, effectively disentangling camera and object motion. This method integrates foundation models with a novel optimization pipeline to track objects, including newly emerging ones, demonstrating superior camera pose estimation and 3D depth consistency, achieving, for example, an Abs Rel depth error of 0.218 on Sintel compared to 0.636 from baselines.

09 Dec 2025

robotics-perception computer-science computer-vision-and-pattern-recognition

SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos

SAM-Body4D introduces a training-free framework for 4D human body mesh recovery from videos, synergistically combining promptable video object segmentation and image-based human mesh recovery models with an occlusion-aware mask refinement module. The system produces temporally consistent and robust mesh trajectories, effectively handling occlusions and maintaining identity across frames.

09 Dec 2025

robotics-perception agent-based-systems computer-science

Using reinforcement learning to probe the role of feedback in skill acquisition

Researchers at ETH Zürich used a reinforcement learning agent to investigate how feedback influences skill acquisition in a complex physical fluid system. Their work demonstrated that learning high-performance skills, particularly those involving non-minimum phase dynamics, can require substantially richer sensory information during training than is necessary for their execution.

09 Dec 2025

robotics-perception computer-science computer-vision-and-pattern-recognition

On-the-fly Large-scale 3D Reconstruction from Multi-Camera Rigs

Tsinghua University

Peking University Nanchang University

Researchers from Peking University, Nanchang University, and Tsinghua University developed the first on-the-fly 3D reconstruction framework for multi-camera rigs, enabling calibration-free, large-scale, and high-fidelity scene reconstruction. The system generates drift-free trajectories and photorealistic novel views, reconstructing 100 meters of road or 100,000 m² of aerial scenes in two minutes.

10 Dec 2025

robotics-perception computer-science artificial-intelligence

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

National University of Singapore

The H2R-Grounder framework enables the translation of human interaction videos into physically grounded robot manipulation videos without requiring paired human-robot demonstration data. Researchers at the National University of Singapore's Show Lab developed this approach, which utilizes a simple 2D pose representation and fine-tunes a video diffusion model on unpaired robot videos, achieving higher human preference for motion consistency (54.5%), physical plausibility (63.6%), and visual quality (61.4%) compared to baseline methods.

07 Dec 2025

robotics-perception agents computer-science

ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems

ProAgent introduces an end-to-end proactive LLM agent system leveraging on-demand multi-modal sensory contexts from AR glasses to anticipate user needs without explicit commands. It achieved a 33.4% higher proactive accuracy and 16.8% higher F1-score for tool calling compared to existing baselines, while operating efficiently on edge devices.

09 Dec 2025

robotics-perception computer-science artificial-intelligence

Masked Generative Policy for Robotic Control

University of Glasgow

Researchers from the University of Glasgow developed the Masked Generative Policy (MGP), a framework that employs masked generative transformers for robotic control. This approach integrates parallel action generation with adaptive token refinement, achieving an average 9% higher success rate across 150 tasks while reducing inference time by up to 35x, particularly excelling in long-horizon and non-Markovian environments.

321

09 Dec 2025

robotics-perception computer-science artificial-intelligence

Training-Time Action Conditioning for Efficient Real-Time Chunking

Physical Intelligence

Researchers at Physical Intelligence optimized Real-Time Chunking for Vision-Language-Action models by introducing training-time action conditioning, moving computationally intensive prefix conditioning from inference to training. This approach maintained task performance and execution speed, notably reducing latency from 135ms to 108ms and improving robustness at higher inference delays compared to prior methods.

09 Dec 2025

robotics-perception agents computer-science

OSMO: Open-Source Tactile Glove for Human-to-Robot Skill Transfer

OSMO is an open-source tactile glove platform designed to capture both shear and normal forces from human demonstrations, facilitating direct transfer of these skills to robots. Policies trained using OSMO achieved 71.69% success in a wiping task, outperforming vision-only baselines (55.75%) by eliminating contact-related failures.

07 Dec 2025

robotics-perception attention-mechanisms computer-science

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

An independent research team secured 1st place in the 2025 BEHAVIOR Challenge, achieving a 26% q-score by enhancing a Vision-Language-Action model (Pi0.5) with innovations like correlated noise for flow matching, "System 2" stage tracking, and practical inference-time heuristics. The approach demonstrated emergent recovery behaviors and addressed challenges in long-horizon, complex manipulation tasks.

08 Dec 2025

robotics-perception computer-science computer-vision-and-pattern-recognition

Unified Camera Positional Encoding for Controlled Video Generation

Monash University VAST Building 4.0 CRC

This work from Monash University introduces Unified Camera Positional Encoding (UCPE), a framework that enables fine-grained control over diverse camera geometries, including 6-DoF poses, intrinsics, and lens distortions, in video generation. UCPE integrates into Diffusion Transformers using a lightweight adapter, leading to superior control accuracy in lens, absolute orientation, and relative pose, while maintaining high visual fidelity.

10 Dec 2025

robotics-perception computer-science computer-vision-and-pattern-recognition

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

University of Cambridge

Peking University Beijing Institute for General Artificial Intelligence PKU-WUHAN Institute for Artificial Intelligence State Key Lab for General Artificial Intelligence

Researchers developed TacThru, a novel See-Through-Skin sensor enabling truly simultaneous tactile and visual perception, which is integrated into the TacThru-UMI imitation learning framework. This combination achieves an 85.5% average success rate in complex robotic manipulation tasks, surpassing vision-only (55.4%) and traditional tactile-visual baselines (66.3%).

08 Dec 2025

robotics-perception computer-science computer-vision-and-pattern-recognition

Online Segment Any 3D Thing as Instance Tracking

AutoSeg3D introduces a tracking-centric framework for online, real-time 3D instance segmentation, leveraging Long-Term Memory, Short-Term Memory, and Spatial Consistency Learning. This method achieves 45.5 AP on ScanNet200, a 2.8 AP improvement over the previous state-of-the-art ESAM, while operating at real-time speeds.

10 Dec 2025

robotics-perception computer-science computer-vision-and-pattern-recognition

LoGoColor: Local-Global 3D Colorization for 360° Scenes

Seoul National University

Single-channel 3D reconstruction is widely used in fields such as robotics and medical imaging. While this line of work excels at reconstructing 3D geometry, the outputs are not colored 3D models, thus 3D colorization is required for visualization. Recent 3D colorization studies address this problem by distilling 2D image colorization models. However, these approaches suffer from an inherent inconsistency of 2D image models. This results in colors being averaged during training, leading to monotonous and oversimplified results, particularly in complex 360° scenes. In contrast, we aim to preserve color diversity by generating a new set of consistently colorized training views, thereby bypassing the averaging process. Nevertheless, eliminating the averaging process introduces a new challenge: ensuring strict multi-view consistency across these colorized views. To achieve this, we propose LoGoColor, a pipeline designed to preserve color diversity by eliminating this guidance-averaging process with a `Local-Global' approach: we partition the scene into subscenes and explicitly tackle both inter-subscene and intra-subscene consistency using a fine-tuned multi-view diffusion model. We demonstrate that our method achieves quantitatively and qualitatively more consistent and plausible 3D colorization on complex 360° scenes than existing methods, and validate its superior color diversity using a novel Color Diversity Index.

10 Dec 2025

robotics-perception computer-science computer-vision-and-pattern-recognition

YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos

Stony Brook University

Rutgers University

Visual navigation has emerged as a practical alternative to traditional robotic navigation pipelines that rely on detailed mapping and path planning. However, constructing and maintaining 3D maps is often computationally expensive and memory-intensive. We address the problem of visual navigation when exploration videos of a large environment are available. The videos serve as a visual reference, allowing a robot to retrace the explored trajectories without relying on metric maps. Our proposed method, YOPO-Nav (You Only Pass Once), encodes an environment into a compact spatial representation composed of interconnected local 3D Gaussian Splatting (3DGS) models. During navigation, the framework aligns the robot's current visual observation with this representation and predicts actions that guide it back toward the demonstrated trajectory. YOPO-Nav employs a hierarchical design: a visual place recognition (VPR) module provides coarse localization, while the local 3DGS models refine the goal and intermediate poses to generate control actions. To evaluate our approach, we introduce the YOPO-Campus dataset, comprising 4 hours of egocentric video and robot controller inputs from over 6 km of human-teleoperated robot trajectories. We benchmark recent visual navigation methods on trajectories from YOPO-Campus using a Clearpath Jackal robot. Experimental results show YOPO-Nav provides excellent performance in image-goal navigation for real-world scenes on a physical robot. The dataset and code will be made publicly available for visual navigation and scene representation research.

10 Dec 2025

robotics-perception agents autonomous-vehicles

UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories

Navigating complex urban environments using natural language instructions poses significant challenges for embodied agents, including noisy language instructions, ambiguous spatial references, diverse landmarks, and dynamic street scenes. Current visual navigation methods are typically limited to simulated or off-street environments, and often rely on precise goal formats, such as specific coordinates or images. This limits their effectiveness for autonomous agents like last-mile delivery robots navigating unfamiliar cities. To address these limitations, we introduce UrbanNav, a scalable framework that trains embodied agents to follow free-form language instructions in diverse urban settings. Leveraging web-scale city walking videos, we develop an scalable annotation pipeline that aligns human navigation trajectories with language instructions grounded in real-world landmarks. UrbanNav encompasses over 1,500 hours of navigation data and 3 million instruction-trajectory-landmark triplets, capturing a wide range of urban scenarios. Our model learns robust navigation policies to tackle complex urban scenarios, demonstrating superior spatial reasoning, robustness to noisy instructions, and generalization to unseen urban settings. Experimental results show that UrbanNav significantly outperforms existing methods, highlighting the potential of large-scale web video data to enable language-guided, real-world urban navigation for embodied agents.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Astra: General Interactive World Model with Autoregressive Denoising

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos

Using reinforcement learning to probe the role of feedback in skill acquisition

On-the-fly Large-scale 3D Reconstruction from Multi-Camera Rigs

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems

Masked Generative Policy for Robotic Control

Training-Time Action Conditioning for Efficient Real-Time Chunking

OSMO: Open-Source Tactile Glove for Human-to-Robot Skill Transfer

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

Unified Camera Positional Encoding for Controlled Video Generation

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

Online Segment Any 3D Thing as Instance Tracking

LoGoColor: Local-Global 3D Colorization for 360° Scenes

YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos

UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories

Events

AI for Law

Personalize Your Feed