alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

UCL

The Art of Scaling Reinforcement Learning Compute for LLMs

15 Oct 2025

Harvard University UT Austin

Researchers introduced a predictive framework for Reinforcement Learning (RL) in Large Language Models (LLMs) using a sigmoidal compute-performance curve, enabling performance extrapolation from smaller runs. Their ScaleRL recipe, demonstrated over 100,000 GPU-hours, achieves an asymptotic reward of 0.61 on verifiable math problems, outperforming established methods while exhibiting predictable scaling across model size, generation length, and multi-task settings.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning

01 Sep 2025

Researchers at Meta AI and collaborators developed Token Assorted, a method that combines discrete latent tokens with text tokens for Large Language Model reasoning. This approach enhances reasoning accuracy on synthetic and mathematical benchmarks while reducing reasoning trace length by an average of 17%.

#chain-of-thought #computer-science #artificial-intelligence

Paper thumbnail

Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models

15 Apr 2025

UCL FAIR at Meta

Researchers from Meta AI, McGill University, and UCL present META MOTIVO, a Behavioral Foundation Model enabling zero-shot whole-body control for humanoid agents across diverse tasks. This model uses an online unsupervised reinforcement learning algorithm, FB-CPR, which learns from unlabeled motion capture data to generate natural, human-like movements. Human evaluations showed a preference for FB-CPR-generated behaviors in terms of naturalness, even over those from reward-optimized agents.

#computer-science #machine-learning #imitation-learning

Paper thumbnail

Reasoning Models Better Express Their Confidence

22 Oct 2025

Despite their strengths, large language models (LLMs) often fail to communicate their confidence accurately, making it difficult to assess when they might be wrong and limiting their reliability. In this work, we demonstrate that reasoning models that engage in extended chain-of-thought (CoT) reasoning exhibit superior performance not only in problem-solving but also in accurately expressing their confidence. Specifically, we benchmark six reasoning models across six datasets and find that they achieve strictly better confidence calibration than their non-reasoning counterparts in 33 out of the 36 settings. Our detailed analysis reveals that these gains in calibration stem from the slow thinking behaviors of reasoning models (e.g., exploring alternative approaches and backtracking) which enable them to adjust their confidence dynamically throughout their CoT, making it progressively more accurate. In particular, we find that reasoning models become increasingly better calibrated as their CoT unfolds, a trend not observed in non-reasoning models. Moreover, removing slow thinking behaviors from the CoT leads to a significant drop in calibration. Lastly, we show that non-reasoning models also demonstrate enhanced calibration when simply guided to slow think via in-context learning, fully isolating slow thinking as the source of the calibration gains.

#chain-of-thought #computer-science #artificial-intelligence

Paper thumbnail

The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

03 Dec 2024

New York University

The PRISM Alignment Dataset is a new resource for understanding human preferences in LLM alignment, specifically capturing subjective, individualized, and multicultural dimensions. It includes 8,011 conversations and detailed participant profiles from a diverse global sample, demonstrating that LLM preferences vary significantly by user demographics and conversation context, and that sampling decisions heavily influence alignment outcomes.

#computer-science #conversational-ai #computation-and-language

Paper thumbnail

Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors

21 Mar 2025

UCL NAVER LABS Europe

A unified 3D reconstruction framework from UCL and Naver Labs Europe enables flexible incorporation of camera and scene priors like intrinsics, poses, and depth maps into transformer-based architectures, achieving state-of-the-art performance across multiple 3D vision tasks while enabling high-resolution processing and efficient pose estimation through dual coordinate prediction.

#computer-science #computer-vision-and-pattern-recognition #geometric-deep-learning

Paper thumbnail

Characterizing Pattern Matching and Its Limits on Compositional Task Structures

26 Nov 2025

hoyeon-chang802

Hoyeon Chang

Despite impressive capabilities, LLMs' successes often rely on pattern-matching behaviors, yet these are also linked to OOD generalization failures in compositional tasks. However, behavioral studies commonly employ task setups that allow multiple generalization sources (e.g., algebraic invariances, structural repetition), obscuring a precise and testable account of how well LLMs perform generalization through pattern matching and their limitations. To address this ambiguity, we first formalize pattern matching as functional equivalence, i.e., identifying pairs of subsequences of inputs that consistently lead to identical results when the rest of the input is held constant. Then, we systematically study how decoder-only Transformer and Mamba behave in controlled tasks with compositional structures that isolate this mechanism. Our formalism yields predictive and quantitative insights: (1) Instance-wise success of pattern matching is well predicted by the number of contexts witnessing the relevant functional equivalence. (2) We prove a tight sample complexity bound of learning a two-hop structure by identifying the exponent of the data scaling law for perfect in-domain generalization. Our empirical results align with the theoretical prediction, under 20x parameter scaling and across architectures. (3) Path ambiguity is a structural barrier: when a variable influences the output via multiple paths, models fail to form unified intermediate state representations, impairing accuracy and interpretability. (4) Chain-of-Thought reduces data requirements yet does not resolve path ambiguity. Hence, we provide a predictive, falsifiable boundary for pattern matching and a foundational diagnostic for disentangling mixed generalization mechanisms.

#chain-of-thought #computer-science #artificial-intelligence

Paper thumbnail

Do Large Language Models Latently Perform Multi-Hop Reasoning?

31 May 2025

Google DeepMind UCL

This study provides empirical evidence that Large Language Models latently perform multi-hop reasoning by internally resolving descriptive mentions to entities and then utilizing that knowledge. While the ability to recall the bridge entity improves with model scale, the subsequent step of leveraging this resolved entity for consistent compositional reasoning shows moderate success and does not significantly scale, suggesting a bottleneck in current LLM architectures for truly compositional knowledge utilization.

#computer-science #computation-and-language #explainable-ai

Paper thumbnail

EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

19 May 2025

Monash University UCL

EFFIBENCH-X introduces the first multi-language benchmark designed to measure the execution time and memory efficiency of code generated by large language models. The evaluation of 26 state-of-the-art LLMs reveals a consistent efficiency gap compared to human-expert solutions, with top models achieving around 62% of human execution time efficiency and varying performance across different programming languages and problem types.

#computer-science #computation-and-language

Paper thumbnail

Best-of-N Jailbreaking

19 Dec 2024

fazl-barez

Fazl Barez

john-hughes

John Hughes

aengus-lynch

Aengus Lynch

Researchers at Anthropic and affiliated universities developed Best-of-N (BoN) Jailbreaking, a simple, black-box method that systematically circumvents safety safeguards in frontier AI models across text, vision, and audio modalities. The approach, which involves repeatedly submitting augmented harmful requests, demonstrates high attack success rates and reveals that adversarial success scales predictably with computational resources following a power-law behavior.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

06 Mar 2025

laura-ruis

Laura Ruis

edward-grefenstette

Edward Grefenstette

University of Toronto UCL

Large Language Models acquire reasoning capabilities by synthesizing procedural knowledge from pretraining data, particularly from code and mathematical formulae, rather than through direct retrieval of specific answers. This mechanism was identified by analyzing the influence of pretraining documents on model outputs using influence functions on Cohere's Command R models.

#computer-science #computation-and-language #machine-learning

Paper thumbnail

Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries

14 Oct 2024

Tel Aviv University

This paper investigated the internal reasoning pathways of large language models for multi-hop queries, identifying a 'hopping too late' phenomenon where later layers fail to effectively compose information despite earlier layers resolving intermediate steps. A novel 'back-patching' method was introduced, which corrected 32% to 66% of previously incorrect multi-hop answers by re-introducing hidden representations to earlier layers.

#computer-science #computation-and-language #explainable-ai

Paper thumbnail

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

31 May 2025

Google DeepMind UCL

A study by researchers from Google DeepMind and Google Research rigorously evaluated Large Language Models' latent multi-hop reasoning abilities using a new shortcut-free dataset, SOCRATES. It found that while LLMs struggle with general latent multi-hop reasoning (e.g., GPT-4o at 7.6% composability), their performance drastically varies by bridge entity type, achieving over 80% for 'country' bridge entities but only 5-6% for 'year' ones.

#computer-science #computation-and-language #explainable-ai

Paper thumbnail

Kolb-Based Experiential Learning for Generalist Agents with Human-Level Kaggle Data Science Performance

15 Sep 2025

UCL Technical University of Darmstadt

Researchers from Huawei Noah's Ark Lab, UCL, and TU Darmstadt developed Agent K, a generalist AI agent utilizing Kolb's experiential learning theory and Vygotsky's Zone of Proximal Development, to autonomously navigate and solve Kaggle data science challenges. The system achieved human-competitive performance, demonstrating an Elo-MMR of 1694 and earning the equivalent of 4 gold and 4 silver medals against human experts across diverse competition types.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

Getting aligned on representational alignment

26 Nov 2024

Google DeepMind University of Cambridge logo

University of Cambridge

Researchers from diverse institutions propose a unifying framework for representational alignment, a concept central to cognitive science, neuroscience, and machine learning. This framework provides a common language and systematically categorizes research objectives and methodological components, aiming to bridge disciplinary fragmentation.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

UniGS: Unified Geometry-Aware Gaussian Splatting for Multimodal Rendering

13 Nov 2025

In this paper, we propose UniGS, a unified map representation and differentiable framework for high-fidelity multimodal 3D reconstruction based on 3D Gaussian Splatting. Our framework integrates a CUDA-accelerated rasterization pipeline capable of rendering photo-realistic RGB images, geometrically accurate depth maps, consistent surface normals, and semantic logits simultaneously. We redesign the rasterization to render depth via differentiable ray-ellipsoid intersection rather than using Gaussian centers, enabling effective optimization of rotation and scale attribute through analytic depth gradients. Furthermore, we derive the analytic gradient formulation for surface normal rendering, ensuring geometric consistency among reconstructed 3D scenes. To improve computational and storage efficiency, we introduce a learnable attribute that enables differentiable pruning of Gaussians with minimal contribution during training. Quantitative and qualitative experiments demonstrate state-of-the-art reconstruction accuracy across all modalities, validating the efficacy of our geometry-aware paradigm. Source code and multimodal viewer will be available on GitHub.

#computer-science #computer-vision-and-pattern-recognition #robotics

Paper thumbnail

How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?

24 Sep 2025

Google DeepMind UCL

Yang et al. investigate large language models' capacity for self-reevaluation by testing their ability to identify and recover from various injected "unhelpful thoughts." The research finds a substantial gap between thought recognition and effective recovery, particularly noting an inverse scaling trend where larger models demonstrate reduced robustness to certain unhelpful thought injections, impacting their reliability and safety.

#adversarial-attacks #adversarial-robustness #agents

Paper thumbnail

How Do Large Language Models Acquire Factual Knowledge During Pretraining?

12 Nov 2024

hoyeon-chang

Hoyeon Chang

jinho-park

Jinho Park

hoyeon-chang802

Hoyeon Chang

Researchers from KAIST, UCL, and KT investigated how Large Language Models acquire and forget factual knowledge during pretraining, introducing metrics like 'effectivity' and 'retainability'. They found that knowledge is gained through incremental 'micro-acquisitions' followed by power-law forgetting, with larger models acquiring knowledge more effectively and varied exposure or larger batch sizes improving retention.

#computer-science #computation-and-language #model-interpretation

Paper thumbnail

Early Weight Averaging meets High Learning Rates for LLM Pre-training

11 Dec 2023

sunny900

Sunny Sanyal

Google DeepMind UT Austin

Researchers from UT Austin, UCL, and Google DeepMind adapted Latest Weight Averaging (LAWA) for large language model pre-training, achieving faster convergence and improved generalization by combining high learning rates with strategically averaged, distant model checkpoints. This approach substantially reduces GPU hours and consistently outperforms existing weight averaging methods across various LLM scales.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation

03 Oct 2024

This work introduces UncertaintyRAG, a lightweight and unsupervised retrieval model for long-context Retrieval-Augmented Generation (RAG). It leverages Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate semantic similarity between text chunks, enhancing robustness to distribution shifts and achieving state-of-the-art average performance on long-context QA and summarization benchmarks while utilizing only 4% of the training data compared to baseline models.

#computer-science #computation-and-language #efficient-transformers

Paper thumbnail

There are no more papers matching your filters at the moment.