alphaXiv

History

Papers Benchmarks

Decode Research

3,705

28 Jan 2025

computer-science machine-learning explainable-ai

Open Problems in Mechanistic Interpretability

Google DeepMind

Harvard University

Northeastern University

Anthropic

Imperial College London

Tel Aviv University University of Melbourne

King’s College London

MIT Apollo Research METR FAR AI Decode Research Goodfire Leap Laboratories Eleuther AI Pr(AI)MATS Timaeus

Eric Todd

Arthur Conmy

A forward-facing review by Apollo Research and a large group of collaborators systematically outlines and categorizes the pressing unresolved challenges in mechanistic interpretability. This work aims to guide future research and accelerate progress towards understanding, controlling, and ensuring the safety of advanced AI systems.

240

24 Nov 2025

computer-science machine-learning explainable-ai

Priors in Time: Missing Inductive Biases for Language Model Interpretability

Harvard University

Stanford University

Boston University

EPFL Decode Research Kempner Institute at Harvard University Goodfire AI

Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.

743

04 Jun 2025

computer-science computation-and-language machine-learning

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Anthropic Decode Research UK AI Security Institute MATS Research

Researchers introduced SAEBench, a comprehensive benchmark for Sparse Autoencoders (SAEs) that evaluates their performance across eight diverse metrics for language model interpretability. The benchmark revealed that architectures like Matryoshka SAEs excel in feature disentanglement and concept detection despite not always optimizing traditional proxy metrics like reconstruction fidelity.

622

07 Feb 2025

computer-science artificial-intelligence machine-learning

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Durham University Apollo Research Decode Research

Bart Bussmann

A rigorous empirical investigation from Durham University researchers fundamentally challenges the assumption that sparse autoencoders can discover canonical units of analysis in language models, introducing novel analytical techniques while demonstrating that feature decomposition varies significantly with autoencoder size and configuration.

464

17 Nov 2025

computer-science artificial-intelligence computation-and-language

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

University College London Decode Research LASR Labs

Researchers from LASR Labs introduce and characterize "feature absorption" in Sparse Autoencoders (SAEs), a phenomenon where a general feature fails to activate in the presence of a more specific, token-aligned feature due to sparsity optimization. Their analysis, including a novel detection metric, empirically confirms this limitation in hundreds of LLM SAEs, demonstrating that absorption rates increase with higher sparsity and wider architectures.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode