alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

State Key Laboratory of General Artificial IntelligenceBIGAIChina

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

05 Dec 2025

pengxiangli

Pengxiang Li

Tsinghua University Peking University logo

Peking University

Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.

#computer-science #computer-vision-and-pattern-recognition #multi-modal-learning

Paper thumbnail

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

09 Apr 2025

muhan-zhang

Muhan Zhang

fanxu-meng

孟凡旭

Peking University BIGAI

PiSSA (Principal Singular Values and Singular Vectors Adaptation) introduces an SVD-based initialization strategy for low-rank adapters in Large Language Models, directly tuning the principal components of pre-trained weight matrices. This approach consistently outperforms LoRA in fine-tuning performance across diverse models and tasks, achieving faster convergence and significantly reducing quantization error in its QPiSSA variant, for example, improving Gemma-7B's GSM8K accuracy by 3.25% over LoRA.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

26 Oct 2025

Tsinghua University BIGAI

SCENEWEAVER is a reflective agentic framework that utilizes Multimodal Large Language Models (MLLMs) for feedback-guided, all-in-one 3D scene synthesis. It unifies diverse synthesis methods through a standardized tool interface and a "reason-act-reflect" paradigm, achieving state-of-the-art results in visual realism, physical plausibility, and semantic alignment across various scene types.

#agentic-frameworks #agents #computer-science

Paper thumbnail

ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning

27 Mar 2025

yuyang-li

Yuyang Li

Tsinghua University Peking University logo

Peking University

Researchers from BIGAI and Tsinghua University introduce MANIPTRANS, a two-stage framework that enables efficient transfer of human bimanual manipulation skills to robotic hands through combined trajectory imitation and residual learning, while generating DEXMANIPNET - a dataset of 3.3K manipulation episodes demonstrating generalization across multiple robotic hand designs.

#computer-science #computer-vision-and-pattern-recognition #robotics

Paper thumbnail

Learning Human-Humanoid Coordination for Collaborative Object Carrying

16 Oct 2025

Peking University The University of Hong Kong logo

The University of Hong Kong

Researchers developed COLA, a framework enabling humanoid robots to compliantly and coordinately carry objects with humans, leveraging a proprioception-only policy learned through a three-step training process. The system achieved a 24.7% reduction in human effort and demonstrated stable performance across diverse objects and terrains in real-world scenarios.

#agent-based-systems #computer-science #artificial-intelligence

Paper thumbnail

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

19 Dec 2024

qing-li

Qing Li

liang-chen

liang chen

Shuzheng Si

University of Illinois at Urbana-Champaign UCLA logo

UltraEdit introduces a large-scale (over 4 million samples), high-quality dataset for instruction-based image editing, designed to mitigate common biases and enhance fine-grained control. Models trained on UltraEdit demonstrate improved performance on established benchmarks like MagicBrush and Emu Edit, particularly for region-based and multi-turn editing tasks.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning

26 Apr 2025

baoxiongjia

BAOXIONG JIA

University of Illinois at Urbana-Champaign UCLA logo

RoboVerse introduces a unified robotics platform combining high-fidelity simulation environments, a large-scale synthetic dataset, and standardized benchmarks for imitation and reinforcement learning, enabling cross-simulator integration and improved sim-to-real transfer through its METASIM infrastructure and diverse data generation approaches.

#computer-science #robotics

Resources 1,477

Paper thumbnail

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

03 Feb 2025

qing-li

Qing Li

zhigao

Zhi Gao

pengxiangli

Pengxiang Li

Tsinghua University Peking University logo

Peking University

Researchers developed an automated pipeline to generate a large-scale multi-modal tool-usage dataset (MM-Traj), which was then used to fine-tune Vision-Language Models (VLMs) as agents, resulting in enhanced multi-step reasoning and tool-usage capabilities across various multi-modal tasks. The T3-Agent, trained on MM-Traj, demonstrated improved accuracy and tool utilization on challenging multi-modal benchmarks like GTA and GAIA, closing the performance gap with larger, closed-source models.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

08 Nov 2025

Xiamen University East China Normal University

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to sample multiple completions for each question. Our experiment and theoretical analysis reveal that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experiments show that CPPO achieves up to

7.98\times

speedup on GSM8K and

3.48\times

on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at \href{this https URL}{this https URL}.

#agents #chain-of-thought #computer-science

Paper thumbnail

TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents

04 Dec 2025

zhigao

Zhi Gao

Shanghai Jiao Tong University Tsinghua University logo

Tsinghua University

Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. Concretely, we crawl and process online GUI tutorials (such as videos and articles) into GUI agent trajectory data, through which we produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents about 10\% on multiple benchmarks, showing the effectiveness of the GUI-Net dataset and underscoring the significance of our TongUI framework. We will fully open-source the code, the GUI-Net dataset, and the trained models soon.

#agents #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

24 Oct 2025

zhigao

Zhi Gao

pengxiangli

Pengxiang Li

Tsinghua University Harbin Institute of Technology

Researchers developed SPORT, an iterative framework enabling multimodal agents to autonomously enhance tool usage without human-annotated data. This method employs step-wise AI feedback and direct preference optimization, achieving up to 6.41% higher answer accuracy on the GTA benchmark and 3.64% higher answer accuracy on the GAIA benchmark compared to strong baselines.

#agents #autonomous-vehicles #computer-science

Paper thumbnail

PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

28 Sep 2025

zhaowei-zhang

Zhaowei Zhang

mengmeng-wang

Mengmeng Wang

Wuhan University

Shanghai Jiao Tong University

PoliCon introduces a benchmark to assess large language models' capabilities in facilitating and achieving political consensus under diverse real-world objectives. Experiments revealed that while models like Gemini-2.5-Flash perform well on straightforward consensus tasks, all tested LLMs struggle with complex goals such as Rawlsianism and exhibit inherent partisan biases, highlighting limitations in sophisticated coalition-building.

#chain-of-thought #computer-science #computers-and-society

Paper thumbnail

CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks

30 Aug 2025

Peking University BIGAI

CLONE introduces a closed-loop system for whole-body humanoid teleoperation that achieves precise control for long-horizon tasks, virtually eliminating positional drift using only head and hand tracking. The system demonstrated a mean tracking error of 5.1 cm over 8.9 meters on a Unitree G1 robot, effectively enabling robust and coordinated execution of diverse dynamic skills.

#computer-science #robotics

Paper thumbnail

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

12 Sep 2025

Peking University BIGAI

JARVIS-VLA, a Vision Language Action model, is introduced for playing visual games in Minecraft using keyboards and mouse, leveraging ActVLP, a three-stage post-training paradigm that improves world knowledge and visual grounding before action tuning. This enabled JARVIS-VLA-Qwen2-VL to achieve state-of-the-art performance on over 1,000 diverse atomic tasks in Minecraft, demonstrating a 40% improvement over prior baselines.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

06 Mar 2025

Wuhan University Zhejiang University logo

Zhejiang University

The paper "Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model" provides a systematic review and unified benchmark for tuning MLLMs, classifying methods into Selective, Additive, and Reparameterization paradigms. It empirically analyzes the trade-offs between task-expert specialization and open-world stabilization, offering practical guidelines for MLLM deployment.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation

03 Dec 2025

Peking University State Key Laboratory of General Artificial Intelligence

In robot manipulation, robot learning has become a prevailing approach. However, generative models within this field face a fundamental trade-off between the slow, iterative sampling of diffusion models and the architectural constraints of faster Flow-based methods, which often rely on explicit consistency losses. To address these limitations, we introduce MP1, which pairs 3D point-cloud inputs with the MeanFlow paradigm to generate action trajectories in one network function evaluation (1-NFE). By directly learning the interval-averaged velocity via the "MeanFlow Identity", our policy avoids any additional consistency constraints. This formulation eliminates numerical ODE-solver errors during inference, yielding more precise trajectories. MP1 further incorporates CFG for improved trajectory controllability while retaining 1-NFE inference without reintroducing structural constraints. Because subtle scene-context variations are critical for robot learning, especially in few-shot learning, we introduce a lightweight Dispersive Loss that repels state embeddings during training, boosting generalization without slowing inference. We validate our method on the Adroit and Meta-World benchmarks, as well as in real-world scenarios. Experimental results show MP1 achieves superior average task success rates, outperforming DP3 by 10.2% and FlowPolicy by 7.3%. Its average inference time is only 6.8 ms-19x faster than DP3 and nearly 2x faster than FlowPolicy. Our project page is available at this https URL, and the code can be accessed at this https URL.

#computer-science #robotics

Paper thumbnail

Aegis: Automated Error Generation and Attribution for Multi-Agent Systems

10 Oct 2025

Chinese Academy of Sciences

National University of Singapore

Large language model based multi-agent systems (MAS) have unlocked significant advancements in tackling complex problems, but their increasing capability introduces a structural fragility that makes them difficult to debug. A key obstacle to improving their reliability is the severe scarcity of large-scale, diverse datasets for error attribution, as existing resources rely on costly and unscalable manual annotation. To address this bottleneck, we introduce Aegis, a novel framework for Automated error generation and attribution for multi-agent systems. Aegis constructs a large dataset of 9,533 trajectories with annotated faulty agents and error modes, covering diverse MAS architectures and task domains. This is achieved using a LLM-based manipulator that can adaptively inject context-aware errors into successful execution trajectories. Leveraging fine-grained labels and the structured arrangement of positive-negative sample pairs, Aegis supports three different learning paradigms: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning. We develop learning methods for each paradigm. Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution. Notably, several of our fine-tuned LLMs demonstrate performance competitive with or superior to proprietary models an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi-agent systems. Our project website is available at this https URL.

#computer-science #robotics

Paper thumbnail

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

09 Jan 2025

jun-guo

Jun Guo

su-rong-peng

苏荣朋

University of Science and Technology of China Tsinghua University logo

Tsinghua University

Embodied VideoAgent develops a multimodal agent that uses persistent object memory from egocentric videos and embodied sensors to understand dynamic 3D scenes. The system achieved an 85.37% success rate for 3D object localization on Ego4D-VQ3D and showed improved performance on embodied question answering benchmarks compared to existing models.

#computer-science #computer-vision-and-pattern-recognition #human-ai-interaction

Paper thumbnail

NEP: Autoregressive Image Editing via Next Editing Token Prediction

11 Oct 2025

Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: this https URL

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

DreamArt: Generating Interactable Articulated Objects from a Single Image

08 Jul 2025

Tsinghua University Peking University logo

Peking University

DreamArt generates high-fidelity, interactable articulated 3D objects from a single image by integrating part-aware 3D reconstruction, conditional video diffusion with novel prompts, and physics-informed articulation optimization. The system demonstrated superior visual quality and articulation plausibility compared to existing methods, achieving higher PSNR, SSIM, and FVD for video synthesis, and significantly better user study scores for overall asset quality.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

There are no more papers matching your filters at the moment.