alphaXiv

History

Papers Benchmarks

Beijing Institute for General Artificial Intelligence (BIGAI)

666

04 Oct 2025

computer-science robotics

Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation

Beijing University of Posts and Telecommunications Beijing Institute for General Artificial Intelligence (BIGAI)

Researchers at BIGAI and UniTree Robotics developed a unified policy for legged robots to control both position and contact force without relying on external force sensors. This policy enhanced imitation learning by generating force-aware data, leading to a 39.5% improvement in success rates for contact-rich manipulation tasks on quadrupedal and humanoid robots.

2,135

15 Jul 2024

attention-mechanisms computer-science computer-vision-and-pattern-recognition

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Peking University Beijing Institute for General Artificial Intelligence (BIGAI)

Qing Li

Zhi Gao

Researchers at BIGAI and Peking University developed VideoAgent, a memory-augmented multimodal agent for long-form video understanding. VideoAgent utilizes a unified memory system, including temporal and object memories, to allow an LLM to perform strong spatial-temporal reasoning without end-to-end training. The model achieved 60.2% accuracy on the EgoSchema full test set, closely approaching Gemini 1.5 Pro's performance, and consistently outperformed open-source baselines across multiple challenging video understanding benchmarks.

170

940

09 May 2024

computer-science artificial-intelligence computation-and-language

An Embodied Generalist Agent in 3D World

Carnegie Mellon University

Tsinghua University

Peking University Beijing Institute for General Artificial Intelligence (BIGAI)

Qing Li

Researchers at the Beijing Institute for General Artificial Intelligence (BIGAI) introduce LEO, an embodied multi-modal generalist agent that perceives, grounds, reasons, plans, and acts in 3D environments. It achieves performance comparable to or exceeding task-specific models across 3D vision-language understanding, scene-grounded dialogue, planning, robotic manipulation, and object navigation tasks.

1,021

23 Aug 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

Tsinghua University Beijing Institute for General Artificial Intelligence (BIGAI)

Qing Li

Jun Guo

This work introduces a framework for open-vocabulary 3D scene understanding by integrating 2D semantic knowledge into 3D Gaussian Splatting (3DGS). It achieves this by projecting features from pre-trained 2D vision-language models onto 3D Gaussians, complemented by a 3D semantic network, enabling real-time language-driven queries for segmentation, localization, and editing in 3D scenes.

168

391

30 Nov 2023

computer-science artificial-intelligence deep-reinforcement-learning

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

UCLA

Peking University Beijing University of Posts and Telecommunications Beijing Institute for General Artificial Intelligence (BIGAI)

蔡少斐

Anji Liu

An agent for open-world environments, JARVI S-1, leverages a memory-augmented multimodal language model to achieve robust multi-task performance and continuous self-improvement in Minecraft. It demonstrates up to 5x higher reliability on complex tasks like `ObtainDiamondPickaxe` compared to prior state-of-the-art models.

357

344

19 Oct 2025

computer-science computer-vision-security computer-vision-and-pattern-recognition

Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark

Shanghai Jiao Tong University

Nanjing University

Tencent Beijing Institute for General Artificial Intelligence (BIGAI)

Qing Li

Pre-trained vision models (PVMs) have demonstrated remarkable adaptability across a wide range of downstream vision tasks, showcasing exceptional performance. However, as these models scale to billions or even trillions of parameters, conventional full fine-tuning has become increasingly impractical due to its high computational and storage demands. To address these challenges, parameter-efficient fine-tuning (PEFT) has emerged as a promising alternative, aiming to achieve performance comparable to full fine-tuning while making minimal adjustments to the model parameters. This paper presents a comprehensive survey of the latest advancements in the visual PEFT field, systematically reviewing current methodologies and categorizing them into four primary categories: addition-based, partial-based, unified-based, and multi-task tuning. In addition, this paper offers an in-depth analysis of widely used visual datasets and real-world applications where PEFT methods have been successfully applied. Furthermore, this paper introduces the V-PEFT Bench, a unified benchmark designed to standardize the evaluation of PEFT methods across a diverse set of vision tasks, ensuring consistency and fairness in comparison. Finally, the paper outlines potential directions for future research to propel advances in the PEFT field. A comprehensive collection of resources is available at this https URL.

629

24 Sep 2024

computer-science artificial-intelligence computation-and-language

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Beijing Institute for General Artificial Intelligence (BIGAI)

Qing Li

Liu Tengyu

Researchers from BIGAI introduce SceneVerse, a million-scale 3D vision-language dataset, alongside GPS, a unified multi-level contrastive pre-training framework for grounded scene understanding. This work achieves state-of-the-art results across 3D visual grounding and question answering tasks while demonstrating strong zero-shot transfer capabilities.

224

278

02 Aug 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges

Peking University

University of California, Santa Cruz Beijing Institute for General Artificial Intelligence (BIGAI)

Researchers from the NLCo Lab at BIGAI and collaborators developed VideoLLaMB, an efficient framework for long streaming video understanding that leverages semantic segmentation and recurrent memory with retrieval to process extended video sequences. The model achieves superior performance across several VideoQA benchmarks, including the new Needle in a Video Haystack (NIAVH) benchmark for fine-grained retrieval, while demonstrating linear GPU memory scaling on single A100 GPUs.

307

12 Apr 2023

computer-science artificial-intelligence computation-and-language

SQA3D: Situated Question Answering in 3D Scenes

UCLA

Tsinghua University

Peking University Beijing Institute for General Artificial Intelligence (BIGAI)

Qing Li

SQA3D introduces a benchmark dataset and task designed to evaluate the embodied scene understanding capabilities of AI agents in 3D environments. It requires agents to infer a situation from text and then perform situated reasoning from that egocentric viewpoint, revealing a substantial performance gap between current models and human capabilities.

148

249

07 Mar 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

UCLA

Tsinghua University Beijing Institute for General Artificial Intelligence (BIGAI)

BAOXIONG JIA

Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios.COME-robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~35%) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.

3,906

174

29 Nov 2023

computer-science artificial-intelligence machine-learning

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

UCLA

Peking University Beijing Institute for General Artificial Intelligence (BIGAI)

蔡少斐

Anji Liu

GROOT is an embodied AI agent that learns to follow open-ended instructions by observing raw gameplay videos in Minecraft. It achieved a 150-point Elo rating advantage (70% win rate) over state-of-the-art baselines on diverse Minecraft tasks, also demonstrating an ability to compose skills from concatenated video instructions.

595

10 Jul 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI

Beijing Institute for General Artificial Intelligence (BIGAI)

PHYSCENE, developed by researchers at BIGAI, generates physically interactable 3D scenes for embodied AI by incorporating physical common sense into a conditional diffusion model. The method significantly reduces object collision rates and enhances scene reachability and navigability, outperforming prior state-of-the-art methods on both traditional and physical plausibility metrics.

127

138

10 Apr 2024

computer-science computer-vision-and-pattern-recognition

CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update

Tsinghua University

Beijing Jiaotong University

Peking University Beijing Institute for General Artificial Intelligence (BIGAI)

Qing Li

Zhi Gao

CLOVA is a closed-loop framework enabling visual assistants to continually learn and update their underlying visual tools and LLM reasoning from failures and human feedback. This system achieves significant performance gains, including over 20% improvement in image editing, across diverse visual tasks by incorporating novel prompt tuning for tools and multimodal reflection for precise error attribution.

238

05 Jul 2025

computer-science artificial-intelligence computation-and-language

In-Context Meta LoRA Generation

Tsinghua University

University of Copenhagen

The Hong Kong Polytechnic University

Peking University University of Trento Beijing Institute for General Artificial Intelligence (BIGAI)

The paper introduces In-Context Meta LoRA (ICM-LoRA), a framework that generates task-specific LoRA parameters for large language and vision-language models from a compact conditional variational autoencoder (CVAE). This method reduces storage requirements to 1% of storing original LoRA weights while achieving performance comparable to or exceeding direct LoRA fine-tuning across diverse tasks.

1,073

08 Oct 2022

computer-science artificial-intelligence computation-and-language

EgoTaskQA: Understanding Human Tasks in Egocentric Videos

Tsinghua University Institute for Artificial Intelligence, Peking University Beijing Institute for General Artificial Intelligence (BIGAI)UCLA Center for Vision, Cognition, Learning, and Autonomy (VCLA)

BAOXIONG JIA

Understanding human tasks through video observations is an essential capability of intelligent agents. The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism from multi-tasking and partial observations in multi-agent collaboration. Most prior works leverage action localization or future prediction as an indirect metric for evaluating such task understanding from videos. To make a direct evaluation, we introduce the EgoTaskQA benchmark that provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others. These questions are divided into four types, including descriptive (what status?), predictive (what will?), explanatory (what caused?), and counterfactual (what if?) to provide diagnostic analyses on spatial, temporal, and causal understandings of goal-oriented tasks. We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos. We hope this effort will drive the vision community to move onward with goal-oriented video understanding and reasoning.

168

22 Jul 2024

computer-science computer-vision-security artificial-intelligence

Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

National University of Singapore

Beihang University

The Hong Kong Polytechnic University

Peking University Beijing Institute for General Artificial Intelligence (BIGAI)

Fangwei Zhong

Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models(VFM) and offline reinforcement learning(offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as "Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust policy within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. We evaluate our agent on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned agent from virtual environments to a real-world robot.

168

29 Oct 2023

computer-science conversational-ai artificial-intelligence

Evaluating and Inducing Personality in Pre-trained Language Models

UCLA

Nanjing University

Tsinghua University

Peking University Beijing Institute for General Artificial Intelligence (BIGAI)

This research introduces a standardized framework, the Machine Personality Inventory (MPI), for quantitatively assessing personality in Large Language Models (LLMs) and proposes PERSONALITY PROMPTING (P²), a method for inducing specific Big Five personality traits. Instruction fine-tuned LLMs like GPT-3.5 and Alpaca 7B demonstrated human-level internal consistency in personality traits, and P² successfully induced targeted personality changes, validated by both MPI scores and human-rated vignette tests.

109

20 May 2024

computer-science artificial-intelligence machine-learning

RulE: Knowledge Graph Reasoning with Rule Embedding

UCLA

Tsinghua University

Peking University Beijing Institute for General Artificial Intelligence (BIGAI)

Muhan Zhang

Knowledge graph (KG) reasoning is an important problem for knowledge graphs. In this paper, we propose a novel and principled framework called \textbf{RulE} (stands for {Rul}e {E}mbedding) to effectively leverage logical rules to enhance KG reasoning. Unlike knowledge graph embedding (KGE) methods, RulE learns rule embeddings from existing triplets and first-order {rules} by jointly representing \textbf{entities}, \textbf{relations} and \textbf{logical rules} in a unified embedding space. Based on the learned rule embeddings, a confidence score can be calculated for each rule, reflecting its consistency with the observed triplets. This allows us to perform logical rule inference in a soft way, thus alleviating the brittleness of logic. On the other hand, RulE injects prior logical rule information into the embedding space, enriching and regularizing the entity/relation embeddings. This makes KGE alone perform better too. RulE is conceptually simple and empirically effective. We conduct extensive experiments to verify each component of RulE. Results on multiple benchmarks reveal that our model outperforms the majority of existing embedding-based and rule-based approaches.

823

15 Oct 2024

computer-science robotics

M2Diffuser: Diffusion-based Trajectory Optimization for Mobile Manipulation in 3D Scenes

Beihang University Xidian University

Peking University

Huazhong University of Science and Technology University of California, Los Angeles (UCLA)Beijing Institute for General Artificial Intelligence (BIGAI)

M2Diffuser introduces a diffusion-based framework for generating coordinated whole-body robot trajectories for mobile manipulation tasks. The approach leverages robot-centric 3D scans and guided optimization to produce physically valid and precise motions, demonstrating effective performance in tasks like object grasping and placement, including successful sim-to-real transfer.

18 Oct 2022

computer-science computer-vision-security artificial-intelligence

HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

Peking University Beijing Institute for General Artificial Intelligence (BIGAI)Beĳing Institute of Technology

Wei Liang

Liu Tengyu

This research introduces HUMANISE, a large-scale, semantic-rich dataset for human-scene interaction, and defines the task of language-conditioned human motion generation in 3D scenes. Their proposed cVAE-based model generates diverse and plausible 3D human motions guided by natural language instructions and 3D scene context, demonstrating improved performance on downstream motion synthesis tasks.

127

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

An Embodied Generalist Agent in 3D World

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges

SQA3D: Situated Question Answering in 3D Scenes

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI

CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update

In-Context Meta LoRA Generation

EgoTaskQA: Understanding Human Tasks in Egocentric Videos

Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

Evaluating and Inducing Personality in Pre-trained Language Models

RulE: Knowledge Graph Reasoning with Rule Embedding

M2Diffuser: Diffusion-based Trajectory Optimization for Mobile Manipulation in 3D Scenes

HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

Events

AI for Law

Personalize Your Feed