alphaXiv

History

Papers Benchmarks

Saarland University

736

08 Oct 2025

computer-science artificial-intelligence machine-learning

GRPO is Secretly a Process Reward Model

Saarland University

Michael Sullivan

This research formally demonstrates that the standard Group Relative Policy Optimization (GRPO) algorithm inherently induces a Monte-Carlo-based Process Reward Model for large language models. It introduces \u03bb-GRPO, a minor modification that corrects a scaling flaw in GRPO's objective, leading to a \u223c2x training speedup and improved performance on downstream reasoning tasks compared to standard GRPO.

1,508

12 Jun 2023

computer-science artificial-intelligence computation-and-language

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

ETH Zurich

KAIST

University of Washington Rensselaer Polytechnic Institute

Google DeepMind

University of Amsterdam

University of Illinois at Urbana-Champaign

University of Cambridge Heidelberg University

University of Waterloo Facebook

Carnegie Mellon University

University of Southern California

Google

New York University University of Stuttgart

UC Berkeley

National University of Singapore

University College London

University of Oxford LMU Munich

Shanghai Jiao Tong University

University of California, Irvine

Tsinghua University

Stanford University

University of Michigan

University of Copenhagen

The Chinese University of Hong Kong University of Melbourne

Meta University of Edinburgh

OpenAI

The University of Texas at Austin

Cornell University

University of California, San Diego Yonsei University

McGill University

Boston University University of Bamberg

Nanyang Technological University

Microsoft

KU Leuven

Columbia University UC Santa Barbara

Allen Institute for AI German Research Center for Artificial Intelligence (DFKI)

University of Pennsylvania

Johns Hopkins University

Arizona State University

University of Maryland

University of Tokyo University of North Carolina at Chapel Hill Hebrew University of Jerusalem Amazon Tilburg University University of Massachusetts Amherst University of Rochester University of Duisburg-Essen Sapienza University of Rome University of Sheffield

Princeton University

HKUST University of Tübingen TU Berlin Saarland University Technical University of Darmstadt University of Haifa University of Trento University of Montreal Bilkent University University of Cape Town Bar Ilan University IBM University of Mannheim

ServiceNow Potsdam University Polish-Japanese Academy of Information Technology Salesforce ASAPP AI21 Labs Valencia Polytechnic University University of Trento, Italy

Allen Nie

Jos Rozen

+13

A large-scale and diverse benchmark, BIG-bench, was introduced to rigorously evaluate the capabilities and limitations of large language models across 204 tasks. The evaluation revealed that even state-of-the-art models currently achieve aggregate scores below 20 (on a 0-100 normalized scale), indicating significantly lower performance compared to human experts.

1,515

26 Jul 2024

agent-based-systems autonomous-vehicles computer-science

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Allen Institute for AI

Stony Brook University Saarland University

AppWorld is a comprehensive framework introducing a high-fidelity simulation of nine everyday applications and a benchmark of 750 complex tasks to evaluate large language model agents. It reveals that state-of-the-art models like GPT-4 complete less than half of normal tasks and around 30% of challenge tasks, indicating significant room for improvement in real-world digital environments.

143

1,377

05 May 2023

adversarial-attacks computer-science computer-vision-security

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Saarland University CISPA – Helmholtz Center for Information Security sequire technology GmbH

Thorsten Holz

This paper identifies and categorizes a novel threat called Indirect Prompt Injection (IPI), demonstrating how malicious prompts embedded in external data can compromise LLM-integrated applications like Bing Chat and GitHub Copilot. The research illustrates that LLMs can be manipulated to exfiltrate data, spread malware, or generate misleading content, often bypassing existing security filters by treating retrieved data as executable instructions.

1,348

08 Apr 2024

computer-science conversational-ai artificial-intelligence

ADaPT: As-Needed Decomposition and Planning with Language Models

Allen Institute for AI Saarland University UNC-Chapel Hill

ADAPT, a framework from UNC Chapel Hill, AI2, and Saarland University, enables Large Language Models to act as robust agents by dynamically decomposing complex tasks into simpler sub-tasks only when needed. This approach significantly increased success rates by up to 33% across diverse interactive environments like ALFWorld, WebShop, and a new TextCraft dataset, outperforming existing plan-and-execute methods and other adaptive baselines.

432

30 Jun 2025

computer-science artificial-intelligence machine-learning

Quantum computing and artificial intelligence: status and perspectives

CNRS Freie Universität Berlin

University of Oxford TU Dortmund University German Research Center for Artificial Intelligence (DFKI)University of Innsbruck Collège de France Max Planck Institute for the Science of Light Friedrich-Alexander-Universität Erlangen-Nürnberg Institut Polytechnique de Paris University of Latvia University of Turku Saarland University Fondazione Bruno Kessler TU Wien

Chalmers University of Technology Forschungszentrum Jülich University of Regensburg University of Florence University of Augsburg University of Gothenburg Leiden Institute of Physics Donostia International Physics Center Johannes Kepler University Linz Fraunhofer Heinrich-Hertz-Institute SAP SE Friedrich-Schiller-University Jena European Centre for Theoretical Studies in Nuclear Physics and Related Areas (ECT*)EPITA Research Lab Leiden Institute of Advanced Computer Science ÖAW Vienna Center for Quantum Science and Technology Atominstitut University of Applied Sciences Zittau/Görlitz IQOQI Vienna Fraunhofer IOSB-AST Universit PSL Inria Paris–Saclay Universit Paris Diderot `Ecole Polytechnique University of Naples “Federico II”INFN Sezione di Firenze

A collaborative white paper coordinated by the Quantum Community Network comprehensively analyzes the current status and future perspectives of Quantum Artificial Intelligence, categorizing its potential into "Quantum for AI" and "AI for Quantum" applications. It proposes a strategic research and development agenda to bolster Europe's competitive position in this rapidly converging technological domain.

263

03 Aug 2025

computer-science artificial-intelligence computation-and-language

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Saarland University

Researchers from Saarland University introduced Neighbor Distance Minimization (NDM), an unsupervised learning method that decomposes neural network representation space into interpretable, non-basis-aligned subspaces. The approach quantitatively demonstrated superior concentration of task-relevant information within identified subspaces and yielded qualitatively distinct feature encodings in GPT-2 Small and larger 2B-parameter models.

644

02 Jun 2025

computer-science computation-and-language human-ai-interaction

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

University of Amsterdam LMU Munich

University of Copenhagen

ETH Zürich Universidade de Lisboa Utrecht University Saarland University University of Potsdam Heriot-Watt University University of Trento MCML Unbabel

A comprehensive empirical study assesses the reliability of Large Language Models (LLMs) as automated evaluators across 20 diverse Natural Language Processing tasks. The research evaluates 11 different LLMs, including both proprietary and open-weight models, against human judgments, revealing that LLM performance varies substantially by task and property evaluated and is generally below human inter-annotator agreement.

226

28 Sep 2023

computer-science artificial-intelligence computation-and-language

LawBench: Benchmarking Legal Knowledge of Large Language Models

Shanghai AI Laboratory

Nanjing University Saarland University Amazon Alexa AI

Songyang Zhang

LawBench introduces the first comprehensive evaluation benchmark for assessing Large Language Models' legal knowledge and capabilities within the Chinese civil law system. The benchmark, featuring 20 diverse tasks categorized into memorization, understanding, and application, reveals GPT-4's leading performance while highlighting significant limitations and areas for improvement across other models in this specialized domain.

377

589

27 May 2025

computer-science computation-and-language

Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

The Hong Kong Polytechnic University Saarland University Eastern Institute of Technology Meituan Inc.

Researchers systematically investigated factors influencing the distillation of Chain-of-Thought (CoT) reasoning into Small Language Models (SLMs), identifying that optimal CoT granularity is non-monotonic and student-dependent, format impact is minimal, and teacher choice effectiveness varies by task. The study revealed a 'Matthew Effect,' where stronger SLMs gained more from CoT distillation, challenging assumptions about knowledge transfer.

02 Oct 2025

computer-science computation-and-language machine-learning

Reason to Rote: Rethinking Memorization in Reasoning

Munich Center for Machine Learning (MCML)LMU Munich Utrecht University Saarland University

A study explores how large language models reconcile memorizing incorrect labels with applying generalizable reasoning. It reveals that models retain correct intermediate computations even for noisy instances, employing "outlier heuristics" in specific neurons to override these results for memorized outputs.

211

30 Jul 2024

computer-science computer-vision-and-pattern-recognition

latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction

Saarland University Max-Planck Institute for Informatics

latentSplat introduces a novel framework that integrates the efficiency of 3D Gaussian Splatting with the generative capabilities of variational autoencoders and GANs, enabling fast and generalizable 3D reconstruction and high-quality novel view synthesis from just two input images. The method achieves superior perceptual and generative quality while maintaining near real-time rendering speeds and can be trained purely on real video data.

198

589

01 Sep 2025

computer-science computation-and-language knowledge-distillation

Not All Data Are Unlearned Equally

Mila - Quebec AI Institute

McGill University Saarland University

Machine unlearning is concerned with the task of removing knowledge learned from particular data points from a trained model. In the context of large language models (LLMs), unlearning has recently received increased attention, particularly for removing knowledge about named entities from models for privacy purposes. While various approaches have been proposed to address the unlearning problem, most existing approaches treat all data points to be unlearned equally, i.e., unlearning that Montreal is a city in Canada is treated exactly the same as unlearning the phone number of the first author of this paper. In this work, we show that this all data is equal assumption does not hold for LLM unlearning. We study how the success of unlearning depends on the frequency of the knowledge we want to unlearn in the pre-training data of a model and find that frequency strongly affects unlearning, i.e., more frequent knowledge is harder to unlearn. Additionally, we uncover a misalignment between probability and generation-based evaluations of unlearning and show that this problem worsens as models become larger. Overall, our experiments highlight the need for better evaluation practices and novel methods for LLM unlearning that take the training data of models into account.

102

24 Sep 2025

agentic-frameworks agents computer-science

Procedural Environment Generation for Tool-Use Agents

Saarland University

Michael Sullivan

RandomWorld introduces a pipeline for procedurally generating interactive tools and compositional tool-use data for LLM agents, enabling large-scale training for online reinforcement learning. This method leads to a new State-of-the-Art on NESTFUL F1-Function (0.96) and F1-Parameter (0.71), significantly enhancing agent performance and generalization.

30 Sep 2025

computer-science machine-learning mathematics

FedMuon: Federated Learning with Bias-corrected LMO-based Optimization

University of Zurich

Kyoto University Saarland University CISPA OIST

Recently, a new optimization method based on the linear minimization oracle (LMO), called Muon, has been attracting increasing attention since it can train neural networks faster than existing adaptive optimization methods, such as Adam. In this paper, we study how Muon can be utilized in federated learning. We first show that straightforwardly using Muon as the local optimizer of FedAvg does not converge to the stationary point since the LMO is a biased operator. We then propose FedMuon which can mitigate this issue. We also analyze how solving the LMO approximately affects the convergence rate and find that, surprisingly, FedMuon can converge for any number of Newton-Schulz iterations, while it can converge faster as we solve the LMO more accurately. Through experiments, we demonstrated that FedMuon can outperform the state-of-the-art federated learning methods.

116

25 Mar 2021

computer-science machine-learning model-interpretation

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

Saarland University École Polytechnique Fédérale de Lausanne

Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches. Code to reproduce our results is available online: this https URL.

134

221

12 Jul 2025

attention-mechanisms chain-of-thought computer-science

Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

Saarland University Sharif University of Technology

Chain-of-thought reasoning and scratchpads have emerged as critical tools for enhancing the computational capabilities of transformers. While theoretical results show that polynomial-length scratchpads can extend transformers' expressivity from

TC^0

PTIME

, their required length remains poorly understood. Empirical evidence even suggests that transformers need scratchpads even for many problems in

TC^0

, such as Parity or Multiplication, challenging optimistic bounds derived from circuit complexity. In this work, we initiate the study of systematic lower bounds for the number of chain-of-thought steps across different algorithmic problems, in the hard-attention regime. We study a variety of algorithmic problems, and provide bounds that are tight up to logarithmic factors. Overall, these results contribute to emerging understanding of the power and limitations of chain-of-thought reasoning.

121

23 May 2025

computer-science computation-and-language reasoning

Language models can learn implicit multi-hop reasoning, but only if they have lots of training data

Utrecht University Saarland University

Language models can perform implicit multi-hop reasoning up to 4 hops, achieving high accuracy when provided with sufficient training data. This capability, however, incurs an exponential increase in data requirements which curriculum learning can substantially reduce.

25 Aug 2025

computer-science computer-vision-and-pattern-recognition generative-models

Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance

Bielefeld University Saarland University Max-Planck Institute for Informatics

We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images by leveraging hand-object interaction as geometric guidance. Our method conditions a latent diffusion model on an inpainted object appearance and uses inference-time guidance to optimize the object reconstruction, while simultaneously ensuring plausible hand-object interactions. Unlike prior methods that rely on extensive post-processing or produce low-quality reconstructions, our approach directly generates high-quality object geometry during the diffusion process by introducing guidance with an optimization-in-the-loop design. Specifically, we guide the diffusion model by applying supervision to the velocity field while simultaneously optimizing the transformations of both the hand and the object being reconstructed. This optimization is driven by multi-modal geometric cues, including normal and depth alignment, silhouette consistency, and 2D keypoint reprojection. We further incorporate signed distance field supervision and enforce contact and non-intersection constraints to ensure physical plausibility of hand-object interaction. Our method yields accurate, robust and coherent reconstructions under occlusion while generalizing well to in-the-wild scenarios.

12 Oct 2025

computer-science computation-and-language data-curation

MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Multi-hop Hate Speech Explanation

University of Southern California University of Melbourne Howard University

Leiden University Saarland University Portland State University University of São Paulo

Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via multi-hop hate speech explanation using the Moral Foundations Theory. MFTCXplain comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Our results show a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. Our findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

GRPO is Secretly a Process Reward Model

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

ADaPT: As-Needed Decomposition and Planning with Language Models

Quantum computing and artificial intelligence: status and perspectives

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

LawBench: Benchmarking Legal Knowledge of Large Language Models

Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Reason to Rote: Rethinking Memorization in Reasoning

latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction

Not All Data Are Unlearned Equally

Procedural Environment Generation for Tool-Use Agents

FedMuon: Federated Learning with Bias-corrected LMO-based Optimization

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

Language models can learn implicit multi-hop reasoning, but only if they have lots of training data

Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance

MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Multi-hop Hate Speech Explanation

Events

AI for Law

Personalize Your Feed