Decode Research
· +3
A forward-facing review by Apollo Research and a large group of collaborators systematically outlines and categorizes the pressing unresolved challenges in mechanistic interpretability. This work aims to guide future research and accelerate progress towards understanding, controlling, and ensuring the safety of advanced AI systems.
Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.
2
Researchers introduced SAEBench, a comprehensive benchmark for Sparse Autoencoders (SAEs) that evaluates their performance across eight diverse metrics for language model interpretability. The benchmark revealed that architectures like Matryoshka SAEs excel in feature disentanglement and concept detection despite not always optimizing traditional proxy metrics like reconstruction fidelity.
A rigorous empirical investigation from Durham University researchers fundamentally challenges the assumption that sparse autoencoders can discover canonical units of analysis in language models, introducing novel analytical techniques while demonstrating that feature decomposition varies significantly with autoencoder size and configuration.
Researchers from LASR Labs introduce and characterize "feature absorption" in Sparse Autoencoders (SAEs), a phenomenon where a general feature fails to activate in the presence of a more specific, token-aligned feature due to sparsity optimization. Their analysis, including a novel detection metric, empirically confirms this limitation in hundreds of LLM SAEs, demonstrating that absorption rates increase with higher sparsity and wider architectures.
8
There are no more papers matching your filters at the moment.