Bristol AI Safety Centre
Sparse Autoencoders Find Highly Interpretable Features in Language Models

Researchers developed Sparse Autoencoders (SAEs) to extract highly interpretable and monosemantic features from the internal activations of pre-trained language models, demonstrating superior interpretability over other methods and enabling more precise causal localization of model behaviors.

View blog
Resources1
There are no more papers matching your filters at the moment.