alphaXiv

History

Papers Benchmarks

BIGAI

1,618

05 Dec 2025

computer-science computer-vision-and-pattern-recognition multi-modal-learning

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Tsinghua University

Peking University Shenzhen MSU-BIT University BIGAI Beĳing Institute of Technology

Pengxiang Li

Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.

1,663

09 Apr 2025

computer-science artificial-intelligence machine-learning

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

Peking University BIGAI

Muhan Zhang

孟凡旭

PiSSA (Principal Singular Values and Singular Vectors Adaptation) introduces an SVD-based initialization strategy for low-rank adapters in Large Language Models, directly tuning the principal components of pre-trained weight matrices. This approach consistently outperforms LoRA in fine-tuning performance across diverse models and tasks, achieving faster convergence and significantly reducing quantization error in its QPiSSA variant, for example, improving Gemma-7B's GSM8K accuracy by 3.25% over LoRA.

324

200

26 Oct 2025

agentic-frameworks agents computer-science

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

Tsinghua University BIGAI State Key Laboratory of General Artificial Intelligence, BIGAI

SCENEWEAVER is a reflective agentic framework that utilizes Multimodal Large Language Models (MLLMs) for feedback-guided, all-in-one 3D scene synthesis. It unifies diverse synthesis methods through a standardized tool interface and a "reason-act-reflect" paradigm, achieving state-of-the-art results in visual realism, physical plausibility, and semantic alignment across various scene types.

1,475

27 Mar 2025

computer-science computer-vision-and-pattern-recognition robotics

ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning

Tsinghua University

Peking University BIGAI

Yuyang Li

Researchers from BIGAI and Tsinghua University introduce MANIPTRANS, a two-stage framework that enables efficient transfer of human bimanual manipulation skills to robotic hands through combined trajectory imitation and residual learning, while generating DEXMANIPNET - a dataset of 3.3K manipulation episodes demonstrating generalization across multiple robotic hand designs.

220

152

16 Oct 2025

agent-based-systems computer-science artificial-intelligence

Learning Human-Humanoid Coordination for Collaborative Object Carrying

Peking University

The University of Hong Kong BIGAI Beĳing Institute of Technology

Researchers developed COLA, a framework enabling humanoid robots to compliantly and coordinately carry objects with humans, leveraging a proprioception-only policy learned through a three-step training process. The system achieved a 24.7% reduction in human effort and demonstrated stable performance across diverse objects and terrains in real-world scenarios.

597

19 Dec 2024

computer-science computer-vision-and-pattern-recognition generative-models

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

University of Illinois at Urbana-Champaign

UCLA

Peking University BIGAI

Qing Li

liang chen

UltraEdit introduces a large-scale (over 4 million samples), high-quality dataset for instruction-based image editing, designed to mitigate common biases and enhance fine-grained control. Models trained on UltraEdit demonstrate improved performance on established benchmarks like MagicBrush and Emu Edit, particularly for region-based and multi-turn editing tasks.

755

26 Apr 2025

computer-science robotics

RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning

University of Illinois at Urbana-Champaign

UCLA

UC Berkeley

Stanford University CMU USC UMich BIGAI PKU

BAOXIONG JIA

RoboVerse introduces a unified robotics platform combining high-fidelity simulation environments, a large-scale synthetic dataset, and standardized benchmarks for imitation and reinforcement learning, enabling cross-simulator integration and improved sim-to-real transfer through its METASIM infrastructure and diverse data generation approaches.

1,477

949

03 Feb 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Tsinghua University

Peking University Shenzhen MSU-BIT University BIGAI Beĳing Institute of Technology

Qing Li

Zhi Gao

Researchers developed an automated pipeline to generate a large-scale multi-modal tool-usage dataset (MM-Traj), which was then used to fine-tune Vision-Language Models (VLMs) as agents, resulting in enhanced multi-step reasoning and tool-usage capabilities across various multi-modal tasks. The T3-Agent, trained on MM-Traj, demonstrated improved accuracy and tool utilization on challenging multi-modal benchmarks like GTA and GAIA, closing the performance gap with larger, closed-source models.

671

04 Dec 2025

agents computer-science computer-vision-and-pattern-recognition

TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents

Shanghai Jiao Tong University

Tsinghua University

Peking University BIGAI Beĳing Institute of Technology

Zhi Gao

Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. Concretely, we crawl and process online GUI tutorials (such as videos and articles) into GUI agent trajectory data, through which we produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents about 10\% on multiple benchmarks, showing the effectiveness of the GUI-Net dataset and underscoring the significance of our TongUI framework. We will fully open-source the code, the GUI-Net dataset, and the trained models soon.

1,253

24 Oct 2025

agents autonomous-vehicles computer-science

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

Tsinghua University Harbin Institute of Technology

Peking University Shenzhen MSU-BIT University BIGAI Beĳing Institute of Technology

Zhi Gao

Pengxiang Li

Researchers developed SPORT, an iterative framework enabling multimodal agents to autonomously enhance tool usage without human-annotated data. This method employs step-wise AI feedback and direct preference optimization, achieving up to 6.41% higher answer accuracy on the GTA benchmark and 3.64% higher answer accuracy on the GAIA benchmark compared to strong baselines.

148

28 Sep 2025

chain-of-thought computer-science computers-and-society

PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

Wuhan University

Shanghai Jiao Tong University

Peking University BIGAI Zhongguancun Academy

Zhaowei Zhang

Mengmeng Wang

PoliCon introduces a benchmark to assess large language models' capabilities in facilitating and achieving political consensus under diverse real-world objectives. Experiments revealed that while models like Gemini-2.5-Flash perform well on straightforward consensus tasks, all tested LLMs struggle with complex goals such as Rawlsianism and exhibit inherent partisan biases, highlighting limitations in sophisticated coalition-building.

367

30 Aug 2025

computer-science robotics

CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks

Peking University BIGAI Beĳing Institute of Technology

CLONE introduces a closed-loop system for whole-body humanoid teleoperation that achieves precise control for long-horizon tasks, virtually eliminating positional drift using only head and hand tracking. The system demonstrated a mean tracking error of 5.1 cm over 8.9 meters on a Unitree G1 robot, effectively enabling robust and coordinated execution of diverse dynamic skills.

1,293

12 Sep 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Peking University BIGAI

JARVIS-VLA, a Vision Language Action model, is introduced for playing visual games in Minecraft using keyboards and mouse, leveraging ActVLP, a three-stage post-training paradigm that improves world knowledge and visual grounding before action tuning. This enabled JARVIS-VLA-Qwen2-VL to achieve state-of-the-art performance on over 1,000 diverse atomic tasks in Minecraft, demonstrating a 40% improvement over prior baselines.

10 Oct 2025

computer-science robotics

Aegis: Automated Error Generation and Attribution for Multi-Agent Systems

Chinese Academy of Sciences

National University of Singapore NUS

Tsinghua University

Peking University THU BIGAI UCAS PKU Beijing Institute of General Artificial Intelligence

Large language model based multi-agent systems (MAS) have unlocked significant advancements in tackling complex problems, but their increasing capability introduces a structural fragility that makes them difficult to debug. A key obstacle to improving their reliability is the severe scarcity of large-scale, diverse datasets for error attribution, as existing resources rely on costly and unscalable manual annotation. To address this bottleneck, we introduce Aegis, a novel framework for Automated error generation and attribution for multi-agent systems. Aegis constructs a large dataset of 9,533 trajectories with annotated faulty agents and error modes, covering diverse MAS architectures and task domains. This is achieved using a LLM-based manipulator that can adaptively inject context-aware errors into successful execution trajectories. Leveraging fine-grained labels and the structured arrangement of positive-negative sample pairs, Aegis supports three different learning paradigms: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning. We develop learning methods for each paradigm. Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution. Notably, several of our fine-tuned LLMs demonstrate performance competitive with or superior to proprietary models an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi-agent systems. Our project website is available at this https URL.

309

09 Jan 2025

computer-science computer-vision-and-pattern-recognition human-ai-interaction

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

University of Science and Technology of China

Tsinghua University

Peking University BIGAI

Jun Guo

苏荣朋

Embodied VideoAgent develops a multimodal agent that uses persistent object memory from egocentric videos and embodied sensors to understand dynamic 3D scenes. The system achieved an 85.37% success rate for 3D object localization on Ego4D-VQ3D and showed improved performance on embodied question answering benchmarks compared to existing models.

154

11 Oct 2025

computer-science computer-vision-and-pattern-recognition generative-models

NEP: Autoregressive Image Editing via Next Editing Token Prediction

BIGAI

Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: this https URL

231

08 Jul 2025

computer-science computer-vision-and-pattern-recognition generative-models

DreamArt: Generating Interactable Articulated Objects from a Single Image

Tsinghua University

Peking University BIGAI

DreamArt generates high-fidelity, interactable articulated 3D objects from a single image by integrating part-aware 3D reconstruction, conditional video diffusion with novel prompts, and physics-informed articulation optimization. The system demonstrated superior visual quality and articulation plausibility compared to existing methods, achieving higher PSNR, SSIM, and FVD for video synthesis, and significantly better user study scores for overall asset quality.

730

12 Aug 2025

computer-science computer-vision-security artificial-intelligence

UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

Beijing Normal University

National University of Singapore

Beihang University

Peking University BIGAI City University of Macau

Fangwei Zhong

We introduce UnrealZoo, a collection of over 100 photo-realistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of open-world environments. We also provide a rich variety of playable entities, including humans, animals, robots, and vehicles for embodied AI research. We extend UnrealCV with optimized APIs and tools for data collection, environment augmentation, distributed training, and benchmarking. These improvements achieve significant improvements in the efficiency of rendering and communication, enabling advanced applications such as multi-agent interactions. Our experimental evaluation across visual navigation and tracking tasks reveals two key insights: 1) environmental diversity provides substantial benefits for developing generalizable reinforcement learning (RL) agents, and 2) current embodied agents face persistent challenges in open-world scenarios, including navigation in unstructured terrain, adaptation to unseen morphologies, and managing latency in the close-loop control systems for interacting in highly dynamic objects. UnrealZoo thus serves as both a comprehensive testing ground and a pathway toward developing more capable embodied AI systems for real-world deployment.

364

24 May 2024

autonomous-vehicles computer-science artificial-intelligence

A Review of Safe Reinforcement Learning: Methods, Theory and Applications

University College London Tongji University

Peking University

King’s College London

Technical University of Munich BIGAI

A systematic review comprehensively organizes the field of safe reinforcement learning, analyzing its methods, theories, and applications through a proposed '2H3W' framework. It provides a detailed examination of sample complexity, benchmarks, and emerging challenges in multi-agent and human-compatible safe RL.

159

25 Aug 2025

attention-mechanisms computer-science artificial-intelligence

TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference

Peking University Tencent YouTu Lab BIGAI

Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, compresses key-value states into a low-rank latent vector, caching only this vector to reduce memory. In tensor parallelism (TP), however, attention heads are computed across multiple devices, and each device must load the full cache, eroding the advantage of MLA over Grouped Query Attention (GQA). We propose Tensor-Parallel Latent Attention (TPLA): a scheme that partitions both the latent representation and each head's input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce. TPLA preserves the benefits of a compressed KV cache while unlocking TP efficiency. Unlike Grouped Latent Attention (GLA), every head in TPLA still leverages the full latent representation, maintaining stronger representational capacity. TPLA is drop-in compatible with models pre-trained using MLA: it supports MLA-style prefilling and enables efficient tensor-parallel decoding without retraining. Applying simple orthogonal transforms -- e.g., the Hadamard transform or PCA -- before TP slicing further mitigates cross-shard interference, yielding minimal accuracy degradation. By reducing the per-device KV cache for DeepSeek-V3 and Kimi-K2, we achieve 1.79x and 1.93x speedups, respectively, at a 32K-token context length while maintaining performance on commonsense and LongBench benchmarks. TPLA can be implemented with FlashAttention-3, enabling practical end-to-end acceleration.

410

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning

Learning Human-Humanoid Coordination for Collaborative Object Carrying

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Aegis: Automated Error Generation and Attribution for Multi-Agent Systems

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

NEP: Autoregressive Image Editing via Next Editing Token Prediction

DreamArt: Generating Interactable Articulated Objects from a Single Image

UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI

A Review of Safe Reinforcement Learning: Methods, Theory and Applications

TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference

Events

AI for Law

Personalize Your Feed