alphaXiv

History

Papers Benchmarks

MATS Research

743

04 Jun 2025

computer-science computation-and-language machine-learning

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Anthropic Decode Research UK AI Security Institute MATS Research

Researchers introduced SAEBench, a comprehensive benchmark for Sparse Autoencoders (SAEs) that evaluates their performance across eight diverse metrics for language model interpretability. The benchmark revealed that architectures like Matryoshka SAEs excel in feature disentanglement and concept detection despite not always optimizing traditional proxy metrics like reconstruction fidelity.

04 Jun 2025

computer-science computation-and-language machine-learning

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Anthropic

University College London Decode Research MATS Research Cambridge Consultants

Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across eight diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across eight recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at: this http URL

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Events

AI for Law

Personalize Your Feed