alphaXiv

History

Papers Benchmarks

Amazon

2,036

05 Dec 2024

computer-science databases human-computer-interaction

Database Theory + X: Database Visualization

Google

Columbia University Amazon

Adobe

Eugene Wu of Columbia University establishes a theoretical framework for "faithful database visualization," proposing that visualizations should directly map not only data content but also underlying database constraints, moving beyond the prevalent single-table data model. This work demonstrates how common visualization designs can be understood as emergent properties of multi-table data structures and their visual constraint preservation.

1,746

25 Jun 2025

computer-science machine-learning robotics

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

University of Washington

UC Berkeley Amazon

A new framework, Diffusion Steering via Reinforcement Learning (DSRL), enables rapid, autonomous adaptation of pre-trained diffusion policies for robotic control by learning to manipulate their latent noise input space. This approach achieves high sample efficiency and black-box compatibility, making it practical for real-world fine-tuning of large generalist robot policies such as π0.

8,349

08 Jan 2025

computer-science computation-and-language information-retrieval

Retrieval-Augmented Generation with Graphs (GraphRAG)

Michigan State University

Meta Pacific Northwest National Laboratory Amazon University of Oregon Snap Inc.Hippocratic AI The Home Depot

Adobe

Wang Yu

Kai Guo

A comprehensive survey details the field of Retrieval-Augmented Generation with Graphs (GraphRAG), proposing a unified framework for integrating graph-structured data into RAG systems and specializing its application across ten distinct domains, providing a structured understanding of current techniques and future research directions.

578

17 Oct 2025

computer-science artificial-intelligence machine-learning

Chronos-2: From Univariate to Universal Forecasting

University of Freiburg

Rutgers University Amazon Boston College Johannes Kepler University Linz Amazon Web Services

Abdul Fatir Ansari

Amazon Web Services researchers developed Chronos-2, a pretrained time series model designed for zero-shot forecasting across univariate, multivariate, and covariate-informed tasks using a unified framework. The model achieved an average win rate of 90.7% and a 47.3% skill score on the fev-bench, demonstrating superior performance, especially when leveraging in-context learning for covariate data.

3,768

1,508

12 Jun 2023

computer-science artificial-intelligence computation-and-language

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

ETH Zurich

KAIST

University of Washington Rensselaer Polytechnic Institute

Google DeepMind

University of Amsterdam

University of Illinois at Urbana-Champaign

University of Cambridge Heidelberg University

University of Waterloo Facebook

Carnegie Mellon University

University of Southern California

Google

New York University University of Stuttgart

UC Berkeley

National University of Singapore

University College London

University of Oxford LMU Munich

Shanghai Jiao Tong University

University of California, Irvine

Tsinghua University

Stanford University

University of Michigan

University of Copenhagen

The Chinese University of Hong Kong University of Melbourne

Meta University of Edinburgh

OpenAI

The University of Texas at Austin

Cornell University

University of California, San Diego Yonsei University

McGill University

Boston University University of Bamberg

Nanyang Technological University

Microsoft

KU Leuven

Columbia University UC Santa Barbara

Allen Institute for AI German Research Center for Artificial Intelligence (DFKI)

University of Pennsylvania

Johns Hopkins University

Arizona State University

University of Maryland

University of Tokyo University of North Carolina at Chapel Hill Hebrew University of Jerusalem Amazon Tilburg University University of Massachusetts Amherst University of Rochester University of Duisburg-Essen Sapienza University of Rome University of Sheffield

Princeton University

HKUST University of Tübingen TU Berlin Saarland University Technical University of Darmstadt University of Haifa University of Trento University of Montreal Bilkent University University of Cape Town Bar Ilan University IBM University of Mannheim

ServiceNow Potsdam University Polish-Japanese Academy of Information Technology Salesforce ASAPP AI21 Labs Valencia Polytechnic University University of Trento, Italy

Allen Nie

Jos Rozen

+13

A large-scale and diverse benchmark, BIG-bench, was introduced to rigorously evaluate the capabilities and limitations of large language models across 204 tasks. The evaluation revealed that even state-of-the-art models currently achieve aggregate scores below 20 (on a 0-100 normalized scale), indicating significantly lower performance compared to human experts.

2,012

15 Jul 2025

chain-of-thought computer-science artificial-intelligence

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Google DeepMind

Anthropic

Université de Montréal

UC Berkeley

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Shanghai Jiao Tong University

Tsinghua University

Stanford University

WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

Georgia Institute of Technology Amazon

University of Virginia

Researchers from Amazon, the University of Virginia, and Georgia Institute of Technology developed WEBAGENT-R1, an end-to-end multi-turn reinforcement learning framework for training large language model (LLM) based web agents. The framework significantly boosted Llama-3.1-8B's task success rate from 20.6% (Behavior Cloning) to 44.8% on the WebArena-Lite benchmark, surpassing strong proprietary models like OpenAI o3 (39.4%).

2,995

05 Nov 2025

agents chain-of-thought computer-science

s3: You Don't Need That Much Data to Train a Search Agent via RL

Amazon University of Illinois Urbana Champaign

Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.

760

1,811

06 Oct 2025

computer-science artificial-intelligence computation-and-language

Instruction Tuning for Large Language Models: A Survey

Nanyang Technological University Amazon ShannonAI ZhejiangUniversity

Guoyin Wang

This survey provides a comprehensive review of instruction tuning for Large Language Models, detailing methodologies, datasets, models, and applications. It highlights how instruction tuning aligns LLMs with human instructions and demonstrates its continued necessity as a foundational step in modern alignment pipelines, while also addressing challenges like superficial alignment.

159

289

27 Oct 2025

computer-science artificial-intelligence computation-and-language

A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

Michigan State University

Microsoft

The Pennsylvania State University Amazon IBM T.J. Watson Research Center The University of Utah

Researchers from Penn State, in collaboration with industry partners, provide the first comprehensive survey of Reinforcement Learning-based agentic search, systematically organizing its foundational concepts, functional roles, optimization strategies, and applications. This work clarifies the interplay between RL and agentic LLMs, delineating current capabilities, evaluation methods, and critical future research directions.

2,177

30 May 2025

attention-mechanisms computer-science computation-and-language

M+: Extending MemoryLLM with Scalable Long-Term Memory

UC San Diego Amazon OPPO MIT-IBM Waston Lab

M+ extends the MemoryLLM architecture, enabling Large Language Models to retain and recall information over sequence lengths exceeding 160,000 tokens, a substantial improvement over MemoryLLM's previous 20,000 token limit. This is achieved through a scalable long-term memory mechanism and a co-trained retriever that efficiently retrieves relevant hidden states, while maintaining competitive GPU memory usage.

119

242

25 Sep 2025

computer-science computation-and-language domain-adaptation

SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

University of Illinois at Urbana-Champaign

Northeastern University

University of Texas at Austin Amazon University of Massachusetts Amherst University at Buffalo

Research from Amazon and academic partners demonstrates that domain-specific Supervised Fine-Tuning (SFT) in Large Language Models does not inherently degrade general capabilities, but rather its impact is strongly influenced by the learning rate. The study introduces Token-Adaptive Loss Reweighting (TALR) as a method to maintain a superior balance between domain-specific performance and general capability preservation.

240

23 Sep 2025

attention-mechanisms computer-science computation-and-language

CompLLM: Compression for Long Context Q&A

Amazon Center for Research in Computer Vision, University of Central Florida

A soft context compression method for Large Language Models, CompLLM, developed by Amazon and UCF, uses a segment-based approach to significantly improve efficiency and scalability for long-context Question & Answer tasks. It achieves up to a 4x speedup in Time To First Token and a 50% reduction in KV cache size, while also enhancing accuracy on contexts exceeding 50,000 tokens.

714

07 Aug 2025

agentic-frameworks agents ai-for-cybersecurity

Establishing Best Practices for Building Rigorous Agentic Benchmarks

University of Illinois at Urbana-Champaign

UC Berkeley

Stanford University

Yale University Amazon

MIT

Princeton University UK AISI Transluce ML Commons

The paper introduces the Agentic Benchmark Checklist (ABC), a systematic framework for designing and assessing rigorous evaluations for AI agents, addressing pervasive flaws in existing benchmarks that can lead to up to 100% relative misestimation of agent capabilities. It identifies key threats to evaluation rigor and demonstrates how applying the ABC empirically validates flaws and guides the creation of more accurate benchmarks.

6,276

02 Sep 2025

chain-of-thought computer-science computation-and-language

When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

Harvard University

New York University Amazon

Xupeng Chen

Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the-art performance on many complex reasoning tasks. However, we uncover a surprising and previously overlooked phenomenon: explicit CoT reasoning can significantly degrade instruction-following accuracy. Evaluating 15 models on two benchmarks: IFEval (with simple, rule-verifiable constraints) and ComplexBench (with complex, compositional constraints), we consistently observe performance drops when CoT prompting is applied. Through large-scale case studies and an attention-based analysis, we identify common patterns where reasoning either helps (e.g., with formatting or lexical precision) or hurts (e.g., by neglecting simple constraints or introducing unnecessary content). We propose a metric, constraint attention, to quantify model focus during generation and show that CoT reasoning often diverts attention away from instruction-relevant tokens. To mitigate these effects, we introduce and evaluate four strategies: in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning. Our results demonstrate that selective reasoning strategies, particularly classifier-selective reasoning, can substantially recover lost performance. To our knowledge, this is the first work to systematically expose reasoning-induced failures in instruction-following and offer practical mitigation strategies.

5,987

05 May 2022

computer-science computer-vision-security computer-vision-and-pattern-recognition

Towards Total Recall in Industrial Anomaly Detection

Amazon University of T ubingen

PatchCore introduces a method for cold-start industrial anomaly detection that leverages locally aware patch features and a coreset-reduced memory bank, achieving state-of-the-art image-level AUROC of 99.6% and efficient pixel-wise anomaly localization on the MVTec AD benchmark.

869

673

03 Jul 2025

agents computer-science computation-and-language

From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

University of Illinois at Urbana-Champaign

UCLA

Tsinghua University

The Chinese University of Hong Kong The Hong Kong University of Science and Technology (Guangzhou)

University of California, San Diego

Peking University University of Illinois Chicago Amazon Zhejiang University of Technology Salesforce AI Research

HKUST

A new paradigm, Agentic Deep Research, is proposed where Large Language Models act as autonomous agents, performing iterative reasoning and strategic search to address complex information needs. This approach empirically outperforms traditional web search and basic RAG systems on challenging benchmarks, demonstrating its ability to significantly reduce user cognitive load for deep information tasks.

496

211

14 Sep 2025

agents computer-science computation-and-language

Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

University of Notre Dame Amazon

Prior works in multi-objective reinforcement learning typically use linear reward scalarization with fixed weights, which provably fail to capture non-convex Pareto fronts and thus yield suboptimal results. This limitation becomes especially critical in online preference alignment for large language models. Here, stochastic trajectories generated by parameterized policies create highly non-linear and non-convex mappings from parameters to objectives that no single static weighting scheme can find optimal trade-offs. We address this limitation by introducing dynamic reward weighting, which adaptively adjusts reward weights during the online reinforcement learning process. Unlike existing approaches that rely on fixed-weight interpolation, our dynamic weighting continuously balances and prioritizes objectives in training, facilitating effective exploration of Pareto fronts in objective space. We introduce two approaches of increasing sophistication and generalizability: (1) hypervolume-guided weight adaptation and (2) gradient-based weight optimization, offering a versatile toolkit for online multi-objective alignment. Our extensive experiments demonstrate their compatibility with commonly used online reinforcement learning algorithms (including GRPO, REINFORCE, and RLOO), effectiveness across multiple mathematical reasoning datasets, and applicability to different model families, consistently achieving Pareto dominant solutions with fewer training steps than fixed-weight linear scalarization baselines.

210

02 Oct 2025

ai-for-health chain-of-thought computer-science

OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data

ETH Zurich

University of Washington

University of Illinois at Urbana-Champaign

Stanford University

Google Research Amazon

Technical University of Munich

OpenTSLM introduces time-series language models that enable large language models (LLMs) to natively integrate and reason over multivariate medical text- and time-series data. The models significantly outperform existing baselines on tasks like ECG question answering and sleep staging, while generating interpretable Chain-of-Thought rationales validated by medical experts.

943

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Database Theory + X: Database Visualization

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

Retrieval-Augmented Generation with Graphs (GraphRAG)

Chronos-2: From Univariate to Universal Forecasting

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

s3: You Don't Need That Much Data to Train a Search Agent via RL

Instruction Tuning for Large Language Models: A Survey

A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

M+: Extending MemoryLLM with Scalable Long-Term Memory

SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

CompLLM: Compression for Long Context Q&A

Establishing Best Practices for Building Rigorous Agentic Benchmarks

When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

Towards Total Recall in Industrial Anomaly Detection

From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data

Events

AI for Law

Personalize Your Feed