Beijing Institute for General Artificial Intelligence (BIGAI)
Researchers at BIGAI and UniTree Robotics developed a unified policy for legged robots to control both position and contact force without relying on external force sensors. This policy enhanced imitation learning by generating force-aware data, leading to a 39.5% improvement in success rates for contact-rich manipulation tasks on quadrupedal and humanoid robots.
84
·
Researchers at BIGAI and Peking University developed VideoAgent, a memory-augmented multimodal agent for long-form video understanding. VideoAgent utilizes a unified memory system, including temporal and object memories, to allow an LLM to perform strong spatial-temporal reasoning without end-to-end training. The model achieved 60.2% accuracy on the EgoSchema full test set, closely approaching Gemini 1.5 Pro's performance, and consistently outperformed open-source baselines across multiple challenging video understanding benchmarks.
170
Researchers at the Beijing Institute for General Artificial Intelligence (BIGAI) introduce LEO, an embodied multi-modal generalist agent that perceives, grounds, reasons, plans, and acts in 3D environments. It achieves performance comparable to or exceeding task-specific models across 3D vision-language understanding, scene-grounded dialogue, planning, robotic manipulation, and object navigation tasks.
15
·
This work introduces a framework for open-vocabulary 3D scene understanding by integrating 2D semantic knowledge into 3D Gaussian Splatting (3DGS). It achieves this by projecting features from pre-trained 2D vision-language models onto 3D Gaussians, complemented by a 3D semantic network, enabling real-time language-driven queries for segmentation, localization, and editing in 3D scenes.
168
· +1
An agent for open-world environments, JARVI S-1, leverages a memory-augmented multimodal language model to achieve robust multi-task performance and continuous self-improvement in Minecraft. It demonstrates up to 5x higher reliability on complex tasks like `ObtainDiamondPickaxe` compared to prior state-of-the-art models.
357
Pre-trained vision models (PVMs) have demonstrated remarkable adaptability across a wide range of downstream vision tasks, showcasing exceptional performance. However, as these models scale to billions or even trillions of parameters, conventional full fine-tuning has become increasingly impractical due to its high computational and storage demands. To address these challenges, parameter-efficient fine-tuning (PEFT) has emerged as a promising alternative, aiming to achieve performance comparable to full fine-tuning while making minimal adjustments to the model parameters. This paper presents a comprehensive survey of the latest advancements in the visual PEFT field, systematically reviewing current methodologies and categorizing them into four primary categories: addition-based, partial-based, unified-based, and multi-task tuning. In addition, this paper offers an in-depth analysis of widely used visual datasets and real-world applications where PEFT methods have been successfully applied. Furthermore, this paper introduces the V-PEFT Bench, a unified benchmark designed to standardize the evaluation of PEFT methods across a diverse set of vision tasks, ensuring consistency and fairness in comparison. Finally, the paper outlines potential directions for future research to propel advances in the PEFT field. A comprehensive collection of resources is available at this https URL.
·
Researchers from BIGAI introduce SceneVerse, a million-scale 3D vision-language dataset, alongside GPS, a unified multi-level contrastive pre-training framework for grounded scene understanding. This work achieves state-of-the-art results across 3D visual grounding and question answering tasks while demonstrating strong zero-shot transfer capabilities.
224
Researchers from the NLCo Lab at BIGAI and collaborators developed VideoLLaMB, an efficient framework for long streaming video understanding that leverages semantic segmentation and recurrent memory with retrieval to process extended video sequences. The model achieves superior performance across several VideoQA benchmarks, including the new Needle in a Video Haystack (NIAVH) benchmark for fine-grained retrieval, while demonstrating linear GPU memory scaling on single A100 GPUs.
61
SQA3D introduces a benchmark dataset and task designed to evaluate the embodied scene understanding capabilities of AI agents in 3D environments. It requires agents to infer a situation from text and then perform situated reasoning from that egocentric viewpoint, revealing a substantial performance gap between current models and human capabilities.
148
Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios.COME-robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~35%) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.
3,906
·
GROOT is an embodied AI agent that learns to follow open-ended instructions by observing raw gameplay videos in Minecraft. It achieved a 150-point Elo rating advantage (70% win rate) over state-of-the-art baselines on diverse Minecraft tasks, also demonstrating an ability to compose skills from concatenated video instructions.
61
PHYSCENE, developed by researchers at BIGAI, generates physically interactable 3D scenes for embodied AI by incorporating physical common sense into a conditional diffusion model. The method significantly reduces object collision rates and enhances scene reachability and navigability, outperforming prior state-of-the-art methods on both traditional and physical plausibility metrics.
127
·
CLOVA is a closed-loop framework enabling visual assistants to continually learn and update their underlying visual tools and LLM reasoning from failures and human feedback. This system achieves significant performance gains, including over 20% improvement in image editing, across diverse visual tasks by incorporating novel prompt tuning for tools and multimodal reflection for precise error attribution.
30
The paper introduces In-Context Meta LoRA (ICM-LoRA), a framework that generates task-specific LoRA parameters for large language and vision-language models from a compact conditional variational autoencoder (CVAE). This method reduces storage requirements to 1% of storing original LoRA weights while achieving performance comparable to or exceeding direct LoRA fine-tuning across diverse tasks.
26
Understanding human tasks through video observations is an essential capability of intelligent agents. The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism from multi-tasking and partial observations in multi-agent collaboration. Most prior works leverage action localization or future prediction as an indirect metric for evaluating such task understanding from videos. To make a direct evaluation, we introduce the EgoTaskQA benchmark that provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others. These questions are divided into four types, including descriptive (what status?), predictive (what will?), explanatory (what caused?), and counterfactual (what if?) to provide diagnostic analyses on spatial, temporal, and causal understandings of goal-oriented tasks. We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos. We hope this effort will drive the vision community to move onward with goal-oriented video understanding and reasoning.
30
Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models(VFM) and offline reinforcement learning(offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as "Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust policy within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. We evaluate our agent on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned agent from virtual environments to a real-world robot.
This research introduces a standardized framework, the Machine Personality Inventory (MPI), for quantitatively assessing personality in Large Language Models (LLMs) and proposes PERSONALITY PROMPTING (P²), a method for inducing specific Big Five personality traits. Instruction fine-tuned LLMs like GPT-3.5 and Alpaca 7B demonstrated human-level internal consistency in personality traits, and P² successfully induced targeted personality changes, validated by both MPI scores and human-rated vignette tests.
49
Knowledge graph (KG) reasoning is an important problem for knowledge graphs. In this paper, we propose a novel and principled framework called \textbf{RulE} (stands for {Rul}e {E}mbedding) to effectively leverage logical rules to enhance KG reasoning. Unlike knowledge graph embedding (KGE) methods, RulE learns rule embeddings from existing triplets and first-order {rules} by jointly representing \textbf{entities}, \textbf{relations} and \textbf{logical rules} in a unified embedding space. Based on the learned rule embeddings, a confidence score can be calculated for each rule, reflecting its consistency with the observed triplets. This allows us to perform logical rule inference in a soft way, thus alleviating the brittleness of logic. On the other hand, RulE injects prior logical rule information into the embedding space, enriching and regularizing the entity/relation embeddings. This makes KGE alone perform better too. RulE is conceptually simple and empirically effective. We conduct extensive experiments to verify each component of RulE. Results on multiple benchmarks reveal that our model outperforms the majority of existing embedding-based and rule-based approaches.
M2Diffuser introduces a diffusion-based framework for generating coordinated whole-body robot trajectories for mobile manipulation tasks. The approach leverages robot-centric 3D scans and guided optimization to produce physically valid and precise motions, demonstrating effective performance in tasks like object grasping and placement, including successful sim-to-real transfer.
91
· +1
This research introduces HUMANISE, a large-scale, semantic-rich dataset for human-scene interaction, and defines the task of language-conditioned human motion generation in 3D scenes. Their proposed cVAE-based model generates diverse and plausible 3D human motions guided by natural language instructions and 3D scene context, demonstrating improved performance on downstream motion synthesis tasks.
127
There are no more papers matching your filters at the moment.