alphaXiv

History

Papers Benchmarks

Beijing Academy of Artificial Intelligence

1,118

01 Dec 2025

computer-science artificial-intelligence computation-and-language

SpikingBrain: Spiking Brain-inspired Large Models

Chinese Academy of Sciences

Beihang University

The Hong Kong Polytechnic University Beijing Academy of Artificial Intelligence Zhongguancun Academy LuxiTech Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology Beijing Key Laboratory of Brain-Inspired General Intelligence Large Model MetaX Integrated Circuit Co., Ltd.

Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

1,022

884

28 Oct 2025

computer-science computer-vision-and-pattern-recognition fine-tuning

Uniform Discrete Diffusion with Metric Path for Video Generation

Chinese Academy of Sciences

Zhejiang University Beijing Academy of Artificial Intelligence National Laboratory of Pattern Recognition, CASIA Key Laboratory of Intelligent Information Processing, ICT, CAS

URSA presents a uniform discrete diffusion framework that incorporates a metric probability path for video generation, enabling iterative global refinement in discrete token space. This framework achieves performance competitive with state-of-the-art continuous diffusion models across text-to-video, image-to-video, and text-to-image benchmarks, while enhancing scalability and multi-task capabilities.

2,895

27 Sep 2025

computer-science artificial-intelligence computation-and-language

OmniGen2: Exploration to Advanced Multimodal Generation

Beijing Academy of Artificial Intelligence

BAAI's OmniGen2 introduces an open-source unified multimodal generative model featuring a decoupled architecture, novel data pipelines, and a new benchmark for complex tasks. It achieves competitive performance on text-to-image (0.86 GenEval), image editing (state-of-the-art among open-source on ImgEdit-Bench with 3.44), and establishes a strong 7.18 baseline on the OmniContext benchmark for in-context generation.

3,872

1,512

25 Oct 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Beihang University

Peking University Beijing Academy of Artificial Intelligence

Cheng Chi

RoboRefer introduces a 3D-aware Vision-Language Model that achieves precise spatial understanding and generalized multi-step spatial reasoning for robotics through a dedicated depth encoder and a sequential SFT-RFT training strategy. It outperforms state-of-the-art models on spatial referring benchmarks, improving average accuracy by 17.4% on RefSpatial-Bench, and successfully executes long-horizon tasks across diverse real-world robots.

2,243

14 Nov 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Beijing Academy of Artificial Intelligence

Huazhong University of Science and Technology Horizon Robotics

Vision Mamba (Vim) introduces the first purely State Space Model-based generic vision backbone, achieving competitive accuracy with Vision Transformers while offering significantly improved computational and memory efficiency for high-resolution images. The architecture adapts the Mamba model with a bidirectional processing scheme and hardware-aware optimizations.

3,220

1,197

14 Sep 2025

computer-science robotics

RoboBrain 2.0 Technical Report

Beijing Academy of Artificial Intelligence

Cheng Chi

Xu Joey

RoboBrain 2.0, an embodied vision-language foundation model from BAAI, integrates perception, reasoning, and planning to empower intelligent interaction in physical environments. It demonstrates state-of-the-art performance across numerous spatial and temporal reasoning benchmarks, including object placement prediction and multi-robot planning, establishing a new baseline for embodied AI capabilities.

639

2,529

05 Oct 2025

chain-of-thought computer-science artificial-intelligence

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

Chinese Academy of Sciences

Peking University Beijing Academy of Artificial Intelligence

Reason-RFT introduces a two-stage reinforcement fine-tuning framework to enhance the visual reasoning and generalization capabilities of Vision-Language Models. The method achieves state-of-the-art or highly competitive performance across various visual reasoning tasks, particularly demonstrating strong generalization under domain shifts while maintaining data efficiency.

2,378

09 Apr 2025

computer-science artificial-intelligence computation-and-language

MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation

Renmin University of China Beijing Academy of Artificial Intelligence

MemoRAG is a retrieval-augmented generation framework that enhances long-context processing by incorporating a global memory module inspired by human cognition. The system processes entire long documents to form a high-level understanding, which then generates "answer clues" to guide precise retrieval, consistently outperforming existing RAG methods and long-context LLMs while efficiently managing GPU memory.

1,836

2,473

06 Feb 2025

autonomous-vehicles computer-science computer-vision-and-pattern-recognition

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Peking University Beijing Academy of Artificial Intelligence Galbot

Uni-NaVid presents a unified Vision-Language-Action (VLA) model capable of addressing four distinct embodied navigation tasks by processing ego-centric video and natural language instructions. The model employs a novel online visual token merging mechanism to achieve a 5 Hz inference speed, demonstrating state-of-the-art performance in simulation and strong real-world generalization.

107

1,614

01 Nov 2025

agentic-frameworks agents computer-science

VideoExplorer: Think With Videos For Agentic Long-Video Understanding

Renmin University of China Beijing Academy of Artificial Intelligence Beijing University of Posts and Telecommunications

Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(this https URL).

1,326

27 Aug 2025

computer-science robotics

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

Peking University Beijing Academy of Artificial Intelligence

The University of Hong Kong Galbot

GraspVLA, a grasping foundation model from Galbot and Peking University, demonstrates robust zero-shot sim-to-real transfer and open-vocabulary grasping by pre-training exclusively on SynGrasp-1B, a billion-scale synthetic action dataset, achieving around 90% success rates on diverse real-world object categories.

250

342

02 Dec 2025

autonomous-vehicles computer-science computer-vision-and-pattern-recognition

DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

Tsinghua University Beijing Academy of Artificial Intelligence

The University of Hong Kong AIR, Tsinghua University

Xiaomi

Researchers from Tsinghua University and Xiaomi EV developed DGGT, a feedforward, pose-free framework for 4D reconstruction of dynamic driving scenes. The method achieves 27.41 dB PSNR and 0.846 SSIM on the Waymo Open Dataset for novel view synthesis, outperforming prior work while processing scenes in 0.39 seconds and demonstrating strong zero-shot generalization across diverse datasets.

1,019

17 Mar 2024

computer-science computation-and-language explainable-ai

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

National University of Singapore Beijing Academy of Artificial Intelligence

HKUST École Polytechnique Fédérale de Lausanne

Researchers from NUS, HKUST, and EPFL propose a systematic black-box framework for evaluating how large language models express confidence. Their empirical study reveals pervasive overconfidence in LLMs, but shows that combining sampling with aggregation strategies can significantly improve failure prediction, for example, boosting AUROC from 54.8% to 92.7% in arithmetic reasoning tasks.

100

2,194

25 Mar 2025

computer-science computer-vision-and-pattern-recognition robotics

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

Chinese Academy of Sciences

Peking University Beijing Academy of Artificial Intelligence

The University of Hong Kong

Researchers from Peking University, BAAI, and CAS present RoboBrain, an MLLM-based model designed to bridge the gap between high-level human instructions and concrete robotic actions. Leveraging the new ShareRobot dataset, RoboBrain integrates planning, affordance perception, and trajectory prediction, achieving improved performance on established robotic benchmarks.

321

294

30 Sep 2025

agents computer-science computer-vision-and-pattern-recognition

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

Chinese Academy of Sciences

University of Science and Technology of China

Zhejiang University Beijing Academy of Artificial Intelligence

Researchers introduce EditScore, a series of high-fidelity, open-source reward models, and a new benchmark, EditReward-Bench, to enable stable online Reinforcement Learning for instruction-guided image editing. These specialized reward models, ranging from 7B to 72B parameters, significantly enhance the performance of leading image editing models like OmniGen2, achieving up to a +0.40 increase in overall score on GEdit-Bench-EN.

2,522

21 Nov 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

OmniGen: Unified Image Generation

Beijing Academy of Artificial Intelligence

huaying yuan

OmniGen, developed by the Beijing Academy of Artificial Intelligence (BAAI), introduces a unified diffusion model capable of performing diverse image generation tasks like text-to-image, image editing, and subject-driven generation within a single framework. The model achieves competitive performance with 3.8 billion parameters, demonstrating parameter efficiency compared to much larger specialized models while streamlining complex multi-step workflows into end-to-end instructions.

3,590

765

01 Jan 2025

computer-science artificial-intelligence computation-and-language

MLVU: Benchmarking Multi-task Long Video Understanding

Zhejiang University

Peking University Beijing Academy of Artificial Intelligence Beijing University of Posts and Telecommunications

Bo Zhang

Researchers from BAAI and several top Chinese universities introduce MLVU, a multi-task long video understanding benchmark designed to rigorously evaluate Multimodal Large Language Models (MLLMs) by utilizing videos from 3 minutes to over 2 hours and featuring 9 diverse tasks. The benchmark reveals that even leading MLLMs like GPT-4o achieve only moderate performance (54.5% average on multiple-choice tasks), indicating significant room for improvement in handling extended temporal reasoning.

175

1,484

01 Mar 2025

computer-science artificial-intelligence computation-and-language

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Beijing Academy of Artificial Intelligence Huazhong University of Science & Technology

JudgeLM, developed by researchers from Huazhong University of Science & Technology and BAAI, is a framework for fine-tuning open-source Large Language Models to serve as scalable and efficient evaluators. It achieves up to 89.32% agreement with GPT-4 and a 133.3x speedup in evaluation time, demonstrating superior alignment with human judgments than its GPT-4 teacher on specific benchmarks.

399

2,392

08 May 2024

computer-science computer-vision-and-pattern-recognition few-shot-learning

Generative Multimodal Models are In-Context Learners

Tsinghua University

Peking University Beijing Academy of Artificial Intelligence

Emu2, a 37-billion-parameter generative multimodal model developed by the Beijing Academy of Artificial Intelligence (BAAI), demonstrates robust in-context learning capabilities by predicting the next multimodal element in a sequence. The model achieves state-of-the-art performance in few-shot settings across a wide range of multimodal understanding and generation tasks, including visual question answering, referring expression comprehension, text-to-image generation, and subject-driven image editing.

1,744

207

20 Oct 2025

computer-science robotics

Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

Fudan University

Peking University Beijing Academy of Artificial Intelligence University of Science and Technology Beijing Beijing Innovation Center of Humanoid Robotics

RoboBench introduces a comprehensive evaluation benchmark for Multimodal Large Language Models (MLLMs) acting as embodied brains, assessing five cognitive dimensions through real-world robotic data and a novel MLLM-as-world-simulator for planning. The benchmark reveals that current MLLMs still lag behind human performance across all evaluated cognitive abilities, particularly in complex planning and execution failure analysis.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

SpikingBrain: Spiking Brain-inspired Large Models

Uniform Discrete Diffusion with Metric Path for Video Generation

OmniGen2: Exploration to Advanced Multimodal Generation

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

RoboBrain 2.0 Technical Report

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

VideoExplorer: Think With Videos For Agentic Long-Video Understanding

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

OmniGen: Unified Image Generation

MLVU: Benchmarking Multi-task Long Video Understanding

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Generative Multimodal Models are In-Context Learners

Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

Events

AI for Law

Personalize Your Feed