alphaXiv

History

Papers Benchmarks

Center for Artificial Intelligence Technology

313

06 Jun 2024

computer-science artificial-intelligence computation-and-language

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

The University of Melbourne MBZUAI HSE University AIRI FRC CSC RAS Center for Artificial Intelligence Technology QCRI

Gleb Kuzmin

Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of a particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for seven LLMs and four languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.

188

30 Jun 2025

computer-science computation-and-language machine-learning

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

University of Amsterdam The University of Melbourne MBZUAI HSE University AIRI Center for Artificial Intelligence Technology

This work introduces LM-Polygraph, an extensive benchmark for evaluating uncertainty quantification (UQ) methods in large language models across diverse text generation tasks and languages. The benchmark identifies effective UQ techniques suitable for specific contexts and demonstrates that performance-calibrated confidence methods yield interpretable and well-calibrated scores.

21 Oct 2025

computer-science computation-and-language machine-translation

Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models

The University of Melbourne MBZUAI HSE University AIRI Center for Artificial Intelligence Technology

Uncertainty quantification (UQ) has emerged as a promising approach for detecting hallucinations and low-quality output of Large Language Models (LLMs). However, obtaining proper uncertainty scores is complicated by the conditional dependency between the generation steps of an autoregressive LLM because it is hard to model it explicitly. Here, we propose to learn this dependency from attention-based features. In particular, we train a regression model that leverages LLM attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens. To incorporate the recurrent features, we also suggest a two-staged training procedure. Our experimental evaluation on ten datasets and three LLMs shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models

Events

AI for Law

Personalize Your Feed