alphaXiv

History

Papers Benchmarks

MBZUAI

128,466

28 Apr 2025

computer-science contrastive-learning computer-vision-and-pattern-recognition

Perception Encoder: The best visual embeddings are not at the output of the network

UT Austin

Fudan University

Meta MBZUAI Meta Reality Labs

Daniel Bolya

Andrea Madotto

Perception Encoder introduces a family of vision models that achieve state-of-the-art performance across diverse vision and vision-language tasks, demonstrating that general, high-quality visual features can be extracted from the intermediate layers of a single, contrastively-trained network. It provides specific alignment tuning methods to make these features accessible for tasks ranging from zero-shot classification to dense spatial prediction and multimodal language understanding.

383

6,659

24 Dec 2023

computer-science artificial-intelligence computation-and-language

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Carnegie Mellon University

UC Berkeley

Stanford University

University of California, San Diego MBZUAI

Dacheng Li

This paper systematically studies using powerful language models as automated judges for evaluating other LLMs, introducing the MT-bench and Chatbot Arena benchmarks for human-preference-aligned assessment. It demonstrates that GPT-4, when used as a judge, exhibits high agreement with human evaluations, provides insights into LLM judge biases, and advocates for a hybrid evaluation framework to holistically measure LLM capabilities and human alignment.

39,148

1,405

18 Oct 2025

agentic-frameworks agents computer-science

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Zhongxing Xu

A comprehensive survey by researchers from Shanghai AI Lab and various global institutions outlines the intricate relationship between scientific large language models (Sci-LLMs) and their data foundations, tracing their evolution towards autonomous agents for scientific discovery. The paper establishes a taxonomy for scientific data and knowledge, meticulously reviews over 270 datasets and 190 benchmarks, and identifies critical data challenges alongside future paradigms.

370

551

16 Oct 2025

attention-mechanisms computer-science artificial-intelligence

Attention Is All You Need for KV Cache in Diffusion LLMs

MBZUAI FPT AI Residency

A new method called Elastic-Cache manages Key-Value (KV) caches in Diffusion Large Language Models (DLMs) by adaptively updating them based on attention patterns and layer-specific dynamics. This approach yields up to a 45.1x speedup over baselines on tasks like GSM8K, maintaining or improving generation accuracy across various text-only and multimodal benchmarks.

723

27 Sep 2025

agents chain-of-thought computer-science

TreeRPO: Tree Relative Policy Optimization

ETH Zurich

Sun Yat-Sen University The Hong Kong University of Science and Technology (Guangzhou)MBZUAI University of California, Merced

HKUST

Zhicheng Yang

TREERPO enhances Large Language Model reasoning by employing a novel tree sampling mechanism to generate fine-grained, step-level reward signals without requiring a separate process reward model. This method improves Pass@1 accuracy by up to 16.5% for Qwen2.5-Math-1.5B and reduces average response length by 18.1% compared to GRPO.

308

14 Oct 2025

agents computer-science artificial-intelligence

Dr.LLM: Dynamic Layer Routing in LLMs

MBZUAI University of Tübingen Tübingen AI Center Paramter Lab NAVER AI Lab.

DR.LLM introduces a retrofittable framework for Large Language Models that dynamically adjusts computational depth, achieving a mean accuracy gain of +2.25 percentage points and 5.0 fewer layers executed on average on in-domain tasks, while robustly generalizing to out-of-domain benchmarks with minimal accuracy drop.

8,046

28 Oct 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

VSA: Faster Video Diffusion with Trainable Sparse Attention

UC Berkeley

University of California, San Diego MBZUAI

Researchers from UC San Diego, MBZUAI, and UC Berkeley developed VSA, a trainable sparse attention mechanism for video Diffusion Transformers. It reduces computational costs and inference latency for video generation by up to 2.53x during training and over 2x during inference while maintaining or improving quality.

2,259

7,710

23 Jul 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

UT Austin

Meta MBZUAI Meta Reality Labs

Daniel Bolya

Andrea Madotto

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models. this https URL

267

16 Sep 2025

active-learning agents computer-science

ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation

MBZUAI

Southern University of Science and Technology Tencent YouTu Lab Spatialtemporal AI

ActiveVLN introduces a two-stage framework that enhances Vision-and-Language Navigation agents through active exploration via multi-turn Reinforcement Learning. The method achieves an 11.6% success rate increase on R2R Val-Unseen and demonstrates competitive performance on RxR Val-Unseen with a smaller model and less data, effectively mitigating covariate shift and reducing expert data dependency.

1,043

17 Jun 2025

computer-science artificial-intelligence computation-and-language

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Carnegie Mellon University

University of California, San Diego MBZUAI

Purdue University

The paper introduces GURU, a multi-domain reinforcement learning corpus with 92K verifiable examples across six distinct reasoning domains, enabling the training of GURU-7B and GURU-32B models. These models achieve state-of-the-art general reasoning capabilities among open RL-trained LLMs, demonstrating how multi-domain RL can lead to robust, generalized reasoning abilities.

106

868

26 Aug 2025

computer-science machine-learning optimization-methods

Predicting the Order of Upcoming Tokens Improves Language Modeling

MBZUAI

The paper introduces Token Order Prediction (TOP), an auxiliary objective for large language model pretraining that trains models to predict the relative proximity of future tokens rather than their exact identity. This method consistently enhances performance across standard NLP benchmarks compared to next-token prediction baselines and multi-token prediction approaches, while being more parameter-efficient.

227

23 Sep 2025

computer-science computation-and-language reasoning

Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction

University of Southern California MBZUAI University of Rochester

HUANXIN SHENG

Researchers from the University of Rochester, University of Southern California, and MBZUAI developed a framework that quantifies uncertainty in Large Language Model (LLM) evaluations for rating-based tasks. This method leverages conformal prediction with a novel ordinal boundary adjustment, yielding statistically guaranteed prediction intervals and reducing bias in LLM judgments.

710

13 Nov 2025

computer-science artificial-intelligence computation-and-language

MMTEB: Massive Multilingual Text Embedding Benchmark

University of Washington

University of Amsterdam

University of Waterloo

Northeastern University

Imperial College London University of Zurich

New York University BAAI Korea University

Allen Institute for AI Aarhus University

University of Pennsylvania

Hugging Face

Johns Hopkins University MBZUAI Jina AI HSE University Sapienza University of Rome

Princeton University ITMO University INSA-Lyon CentraleSupélec

Durham University CISCO Systems Hong Kong University FRC CSC RAS Koç University

ServiceNow Contextual AI Comenius University Bratislava Apart Research Wikit Heritage Institute Of Technology Salesforce The London Institute of Banking and Finance Tano Labs National Information Processing Institute Esker Artefact Research Center R. V. College of Engineering Ellamind Occiglot SaluteDevices Nirma University Robert Koch Institute Wrocław University Illuin Technology I.I.T Madras

Marek Suppa

A collaborative effort produced MMTEB, the Massive Multilingual Text Embedding Benchmark, which offers over 500 quality-controlled evaluation tasks across more than 250 languages and 10 categories. The benchmark incorporates significant computational optimizations to enable accessible evaluation and reveals that instruction tuning enhances model performance, with smaller, broadly multilingual models often outperforming larger, English-centric models in low-resource contexts.

1,004

13 Oct 2025

causal-inference computer-science artificial-intelligence

Discovering and Reasoning of Causality in the Hidden World with Large Language Models

Carnegie Mellon University

The Chinese University of Hong Kong The University of Melbourne MBZUAI The University of Sydney Hong Kong Baptist University

Chenxi Liu

Revealing hidden causal variables alongside the underlying causal mechanisms is essential to the development of science. Despite the progress in the past decades, existing practice in causal discovery (CD) heavily relies on high-quality measured variables, which are usually given by human experts. In fact, the lack of well-defined high-level variables behind unstructured data has been a longstanding roadblock to a broader real-world application of CD. This procedure can naturally benefit from an automated process that can suggest potential hidden variables in the system. Interestingly, Large language models (LLMs) are trained on massive observations of the world and have demonstrated great capability in processing unstructured data. To leverage the power of LLMs, we develop a new framework termed Causal representatiOn AssistanT (COAT) that incorporates the rich world knowledge of LLMs to propose useful measured variables for CD with respect to high-value target variables on their paired unstructured data. Instead of directly inferring causality with LLMs, COAT constructs feedback from intermediate CD results to LLMs to refine the proposed variables. Given the target variable and the paired unstructured data, we first develop COAT-MB that leverages the predictivity of the proposed variables to iteratively uncover the Markov Blanket of the target variable. Built upon COAT-MB, COAT-PAG further extends to uncover a more complete causal graph, i.e., Partial Ancestral Graph, by iterating over the target variables and actively seeking new high-level variables. Moreover, the reliable CD capabilities of COAT also extend the debiased causal inference to unstructured data by discovering an adjustment set. We establish theoretical guarantees for the CD results and verify their efficiency and reliability across realistic benchmarks and real-world case studies.

781

01 May 2025

agents chain-of-thought computer-science

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Alibaba Group

Zhejiang University

Peking University MBZUAI Zhejiang University of Technology Harvard T.H. Chan School of Public Health Hong Kong University of Science and Technology (Guangzhou)NIO HSBC Mindverse AI

BrowseComp-ZH introduces the first comprehensive benchmark for evaluating large language models' web browsing and reasoning capabilities in the Chinese information environment. The benchmark reveals consistently low performance across models and underscores the unique challenges of effectively integrating and reconciling retrieved information from the complex Chinese web.

115

207

17 Oct 2025

computer-science artificial-intelligence machine-learning

Tequila: Trapping-free Ternary Quantization for Large Language Models

City University of Hong Kong

McGill University MBZUAI

Tencent

By introducing "trapping-free" ternary quantization, Tequila effectively reactivates inactive weights in Large Language Models, enabling near full-precision accuracy on zero-shot benchmarks while achieving a 3.0x inference speedup and reducing training data by 10x for efficient edge deployment.

182

687

06 Oct 2025

agents computer-science artificial-intelligence

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Sun Yat-Sen University The Hong Kong University of Science and Technology (Guangzhou)MBZUAI University of California, Merced

HKUST ETH AI Center, ETH Zurich

Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO's mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.

272

13 Oct 2025

agents attention-mechanisms autonomous-vehicles

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

Monash University Shanghai AI Laboratory

Shanghai Jiao Tong University MBZUAI DeepGlint

Researchers from MBZUAI, Monash University, and other institutions developed StreamAgent, an anticipatory agent for streaming video understanding that integrates proactive temporal and spatial anticipation with a novel streaming Key-Value cache. This framework achieves state-of-the-art accuracy on streaming benchmarks, demonstrating up to 10.7% improvement in "Forward Active Responding" over prior models while reducing latency by up to 30% compared to existing efficient video processing methods.

696

20 Jul 2024

attention-mechanisms computer-science computer-vision-and-pattern-recognition

LongVLM: Efficient Long Video Understanding via Large Language Models

Monash University MBZUAI UTS

Mingfei Han

Researchers from Monash University and collaborating institutions introduce LongVLM, a VideoLLM that achieves fine-grained understanding of long videos by efficiently integrating local, temporally ordered segment features with global semantic context. The model outperforms previous state-of-the-art methods on the VideoChatGPT benchmark, showing improvements in Detail Orientation (+0.17) and Consistency (+0.65), and achieves higher accuracy on zero-shot video QA datasets like ANET-QA, MSRVTT-QA, and MSVD-QA.

103

945

25 Jun 2025

computer-science robotics

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

Sun Yat-Sen University MBZUAI

Southern University of Science and Technology

Researchers from MBZUAI, Sun Yat-sen University, and SUSTech developed A0, a hierarchical diffusion model for general robotic manipulation that understands spatial affordances by predicting object-centric contact points and post-contact trajectories. The embodiment-agnostic model achieved an average success rate of 62.50% on Franka and 53.75% on Kinova robots in real-world tasks, outperforming state-of-the-art methods, particularly in trajectory-intensive manipulations, with fewer execution steps.

There are no more papers matching your filters at the moment.

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Perception Encoder: The best visual embeddings are not at the output of the network

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Attention Is All You Need for KV Cache in Diffusion LLMs

TreeRPO: Tree Relative Policy Optimization

Dr.LLM: Dynamic Layer Routing in LLMs

VSA: Faster Video Diffusion with Trainable Sparse Attention

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Predicting the Order of Upcoming Tokens Improves Language Modeling

Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction

MMTEB: Massive Multilingual Text Embedding Benchmark

Discovering and Reasoning of Causality in the Hidden World with Large Language Models

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Tequila: Trapping-free Ternary Quantization for Large Language Models

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

LongVLM: Efficient Long Video Understanding via Large Language Models

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

Personalize Your Feed