alphaXiv

History

Papers Benchmarks

UC Santa Barbara

1,508

12 Jun 2023

computer-science artificial-intelligence computation-and-language

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

ETH Zurich

KAIST

University of Washington Rensselaer Polytechnic Institute

Google DeepMind

University of Amsterdam

University of Illinois at Urbana-Champaign

University of Cambridge Heidelberg University

University of Waterloo Facebook

Carnegie Mellon University

University of Southern California

Google

New York University University of Stuttgart

UC Berkeley

National University of Singapore

University College London

University of Oxford LMU Munich

Shanghai Jiao Tong University

University of California, Irvine

Tsinghua University

Stanford University

University of Michigan

University of Copenhagen

The Chinese University of Hong Kong University of Melbourne

Meta University of Edinburgh

OpenAI

The University of Texas at Austin

Cornell University

University of California, San Diego Yonsei University

McGill University

Boston University University of Bamberg

Nanyang Technological University

Microsoft

KU Leuven

Columbia University UC Santa Barbara

Allen Institute for AI German Research Center for Artificial Intelligence (DFKI)

University of Pennsylvania

Johns Hopkins University

Arizona State University

University of Maryland

University of Tokyo University of North Carolina at Chapel Hill Hebrew University of Jerusalem Amazon Tilburg University University of Massachusetts Amherst University of Rochester University of Duisburg-Essen Sapienza University of Rome University of Sheffield

Princeton University

HKUST University of Tübingen TU Berlin Saarland University Technical University of Darmstadt University of Haifa University of Trento University of Montreal Bilkent University University of Cape Town Bar Ilan University IBM University of Mannheim

ServiceNow Potsdam University Polish-Japanese Academy of Information Technology Salesforce ASAPP AI21 Labs Valencia Polytechnic University University of Trento, Italy

Allen Nie

Jos Rozen

+13

A large-scale and diverse benchmark, BIG-bench, was introduced to rigorously evaluate the capabilities and limitations of large language models across 204 tasks. The evaluation revealed that even state-of-the-art models currently achieve aggregate scores below 20 (on a 0-100 normalized scale), indicating significantly lower performance compared to human experts.

1,063

07 Jul 2025

agents computer-science computation-and-language

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

University of Waterloo

Tsinghua University UC Santa Barbara Salesforce Research

VLM2Vec-V2, developed by researchers from Salesforce Research and collaborating universities, introduces a unified multimodal embedding model capable of processing and aligning videos, images, and visual documents with text. The model achieves the highest overall score of 58.0 on the newly introduced MMEB-V2 benchmark, which expands evaluation to 78 tasks across these diverse visual modalities.

432

286

01 Sep 2025

computer-science cryptography-and-security

From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs

Boston University UC Santa Barbara

Arizona State University UNSW Sydney

Researchers introduce CVE-GENIE, an automated multi-agent framework leveraging Large Language Models (LLMs) to reproduce Common Vulnerabilities and Exposures (CVEs) and generate verifiable exploits. The framework successfully reproduced 428 out of 841 CVEs (51%) across diverse projects, languages, and vulnerability types, creating a substantial dataset for cybersecurity research.

925

20 Jun 2025

computer-science artificial-intelligence computation-and-language

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

UC Santa Barbara

University of California, Santa Cruz

Chengzhi Liu

Researchers from UC Santa Cruz, Stanford, and UC Santa Barbara systematically investigate amplified visual hallucination in multimodal reasoning models, demonstrating that longer reasoning chains increase ungrounded content. They introduce RH-AUC and RH-Bench to quantify the trade-off between reasoning performance and perceptual fidelity across varying reasoning depths.

229

21 Nov 2025

agentic-frameworks agents computer-science

Budget-Aware Tool-Use Enables Effective Agent Scaling

Google DeepMind

New York University UC Santa Barbara Google Cloud AI Research

This research introduces budget-aware strategies for tool-augmented large language model agents to improve efficiency and performance under resource constraints. The proposed methods, including a Budget Tracker and the BATS framework, enable agents to strategically utilize external tools and achieve higher accuracy with fewer resources compared to traditional approaches.

938

27 Apr 2025

computer-science artificial-intelligence computation-and-language

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Google DeepMind UC Santa Barbara CMU Google Cloud AI Research

Zifeng Wang

Chen-Yu Lee

Speculative Knowledge Distillation (SKD) introduces an interleaved sampling method for LLM compression, dynamically blending teacher-guided corrections with student-generated tokens. This approach consistently outperforms existing knowledge distillation techniques, achieving substantial gains across diverse tasks and data regimes while providing more stable training.

36,524

1,219

09 Oct 2024

computer-science conversational-ai artificial-intelligence

From Persona to Personalization: A Survey on Role-Playing Language Agents

Wuhan University

Fudan University UC Santa Barbara Shanghai University System Inc.

Jiangjie Chen

Siyu Yuan

This paper conducts a comprehensive survey of Role-Playing Language Agents (RPLAs) developed using Large Language Models, proposing a three-tiered taxonomy for personas, detailing their construction and evaluation methodologies, and identifying associated risks and market applications. It systematically organizes current research, providing a foundational understanding of the field's evolution and bridging theoretical insights with practical demands.

3,131

02 Aug 2024

computer-science computation-and-language machine-learning

A Survey on Data Selection for Language Models

University of Toronto

Stanford University UC Santa Barbara

Allen Institute for AI Vector Institute Contextual AI

Alon Albalak

Liangming Pan

A comprehensive survey systematically reviews and categorizes data selection methods for large language models, presenting a unified conceptual framework and taxonomy for understanding diverse approaches across various training stages. The work provides a structured overview of current practices, identifies challenges, and proposes future research directions, aiming to democratize knowledge in this critical area of LLM development.

211

356

08 Oct 2025

computer-science artificial-intelligence computation-and-language

Adaptive Layer-skipping in Pre-trained LLMs

UC Santa Barbara

Various layer-skipping methods have been proposed to accelerate token generation in large language models (LLMs). However, limited attention has been paid to a fundamental question: How do computational demands vary across the generation of different tokens? In this work, we introduce FlexiDepth, a method that dynamically adjusts the number of Transformer layers used in text generation. By incorporating a plug-in router and adapter, FlexiDepth enables adaptive computation in LLMs without modifying their original parameters. Applied to Llama-3-8B, it skips 8 out of 32 layers while maintaining full benchmark performance. Our experiments reveal that computational demands in LLMs significantly vary based on token type. Specifically, generating repetitive tokens or fixed phrases requires fewer layers, whereas producing tokens involving computation or high uncertainty requires more layers. Despite the computational savings, FlexiDepth does not yet achieve wall-clock speedup due to varied skipping patterns and I/O overhead. To inspire future work and advance research on practical speedup, we open-sourced FlexiDepth and a dataset documenting its layer allocation patterns.

617

30 Aug 2025

agentic-frameworks agents ai-for-cybersecurity

Progent: Programmable Privilege Control for LLM Agents

UC Berkeley UC Santa Barbara

Progent, developed by researchers including those from UC Berkeley, introduces a programmable privilege control framework for Large Language Model (LLM) agents, deterministically blocking malicious tool calls. The system achieves a 0% attack success rate across various benchmarks, including prompt injection and malicious tools, while preserving agent utility and incurring negligible runtime overhead.

245

02 Jul 2024

computer-science computation-and-language model-interpretation

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

UC Santa Barbara MIT-IBM Watson AI Lab

Adobe

Researchers introduce VSP, a benchmark for evaluating Vision Language Models on visual spatial planning tasks. Experiments show that current state-of-the-art VLMs exhibit sub-optimal performance, with visual perception identified as a major bottleneck limiting their ability to comprehend spatial arrangements and devise multi-step action plans.

1,572

02 Apr 2025

computer-science computation-and-language data-curation

Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

ByteDance

NVIDIA UC Santa Barbara

Open-Qwen2VL presents a 2B-parameter multimodal LLM pre-trained using 220 A100-40G GPU hours, which outperforms Qwen2-VL-2B on several benchmarks. The project provides a fully open-source training pipeline, data filtering techniques, and pre-training data to promote reproducibility and accessibility.

230

13 Oct 2023

adversarial-robustness computer-science computation-and-language

Provable Robust Watermarking for AI-Generated Text

UC Santa Barbara

We study the problem of watermarking large language models (LLMs) generated text -- one of the most promising approaches for addressing the safety challenges of LLM usage. In this paper, we propose a rigorous theoretical framework to quantify the effectiveness and robustness of LLM watermarks. We propose a robust and high-quality watermark method, Unigram-Watermark, by extending an existing approach with a simplified fixed grouping strategy. We prove that our watermark method enjoys guaranteed generation quality, correctness in watermark detection, and is robust against text editing and paraphrasing. Experiments on three varying LLMs and two datasets verify that our Unigram-Watermark achieves superior detection accuracy and comparable generation quality in perplexity, thus promoting the responsible use of LLMs. Code is available at https://github.com/XuandongZhao/Unigram-Watermark.

170

25 Aug 2025

computer-science machine-learning geometric-deep-learning

TopoBench: A Framework for Benchmarking Topological Deep Learning

UC Santa Barbara The University of Manchester Instituto Superior Técnico Sapienza University of Rome University of South Florida University of San Franciso RWTH Aachen University

Paul Snopov

TopoBench introduces an open-source, modular framework designed to standardize benchmarking for Topological Deep Learning (TDL) and accelerate research in the field. Empirical evaluations using the framework demonstrate that higher-order neural networks frequently outperform traditional Graph Neural Networks on tasks benefiting from complex multi-way interactions across diverse datasets.

561

27 Oct 2023

computer-science artificial-intelligence machine-learning

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

California Institute of Technology UT Austin

NVIDIA UC Santa Barbara

MIT

Alex Gu

LeanDojo introduces an open-source toolkit, benchmark, and the ReProver, a retrieval-augmented language model, to advance automated theorem proving in Lean. ReProver achieved a 51.2% Pass@1 on the standard LeanDojo Benchmark and discovered 65 new formal proofs across MiniF2F and ProofNet, demonstrating improved generalization over non-retrieval methods and general-purpose LLMs.

380

05 Feb 2024

computer-science computer-vision-and-pattern-recognition human-ai-interaction

Guiding Instruction-based Image Editing via Multimodal Large Language Models

UC Santa Barbara

Apple

MLLM-Guided Image Editing (MGIE) introduces a framework that uses a Multimodal Large Language Model to interpret ambiguous human instructions and generate explicit, visually-aware guidance for a diffusion model. This approach enables more accurate and versatile image editing across various tasks, outperforming existing instruction-based methods in both quantitative and human evaluations.

2,107

02 Apr 2025

chain-of-thought computer-science computation-and-language

ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

UC Santa Barbara

MIT MIT-IBM Watson AI Lab

Researchers from UCSB and MIT develop THINKPRUNE, a reinforcement learning framework that reduces the token length of large language models' chain-of-thought reasoning while maintaining performance, achieving a 65% reduction in generation length (from 10,355 to 3,574 tokens) on the DeepSeek-R1-Distill-Qwen-1.5B model through iterative length pruning and reward optimization.

882

15 Jun 2024

computer-science artificial-intelligence computation-and-language

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

Stanford University UC Santa Barbara NEC Labs America

Zach Izzo

Researchers at Stanford University developed a statistical method, "distributional GPT quantification," to estimate the proportion of AI-modified content in large text corpora without classifying individual documents. Applying this method to peer reviews, they found that 7-17% of sentences in AI conference reviews were substantially AI-modified post-ChatGPT's release, a pattern not observed in Nature Portfolio journals.

342

10 Jun 2024

computer-science computation-and-language ensemble-methods

Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling

UC Santa Barbara

MIT MIT-IBM Watson AI Lab

Researchers developed Input Clarification Ensembling (ICE), a framework that decomposes the total uncertainty of black-box Large Language Models into aleatoric (input ambiguity) and epistemic (model knowledge) components. This decomposition allows for identifying ambiguous inputs, significantly improving the recall of correct answers when users are prompted for clarification.

583

19 Mar 2022

computer-science computer-vision-and-pattern-recognition information-extraction

DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

UC Santa Barbara

Creating presentation materials requires complex multimodal reasoning skills to summarize key concepts and arrange them in a logical and visually pleasing manner. Can machines learn to emulate this laborious process? We present a novel task and approach for document-to-slide generation. Solving this involves document summarization, image and text retrieval, slide structure and layout prediction to arrange key elements in a form suitable for presentation. We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner. Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides. To help accelerate research in this domain, we release a dataset about 6K paired documents and slide decks used in our experiments. We show that our approach outperforms strong baselines and produces slides with rich content and aligned imagery.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Budget-Aware Tool-Use Enables Effective Agent Scaling

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

From Persona to Personalization: A Survey on Role-Playing Language Agents

A Survey on Data Selection for Language Models

Adaptive Layer-skipping in Pre-trained LLMs

Progent: Programmable Privilege Control for LLM Agents

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

Provable Robust Watermarking for AI-Generated Text

TopoBench: A Framework for Benchmarking Topological Deep Learning

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Guiding Instruction-based Image Editing via Multimodal Large Language Models

ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling

DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

Events

AI for Law

Personalize Your Feed