LIPN (Sorbonne Paris Nord)
Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

ClassifSAE, a novel supervised Sparse Autoencoder, extracts influential and interpretable concepts from fine-tuned Large Language Models for text classification. The method consistently outperforms existing baselines in causality and interpretability, achieving up to a -30.90% reduction in accuracy upon concept ablation for Pythia-1B on AG News and higher `ConceptSim` scores ranging from 0.1309 to 0.1377.

View blog
Resources
Predicting memorization within Large Language Models fine-tuned for classification
Large Language Models have received significant attention due to their abilities to solve a wide range of complex tasks. However these models memorize a significant proportion of their training data, posing a serious threat when disclosed at inference time. To mitigate this unintended memorization, it is crucial to understand what elements are memorized and why. This area of research is largely unexplored, with most existing works providing a posteriori explanations. To address this gap, we propose a new approach to detect memorized samples a priori in LLMs fine-tuned for classification tasks. This method is effective from the early stages of training and readily adaptable to other classification settings, such as training vision models from scratch. Our method is supported by new theoretical results, and requires a low computational budget. We achieve strong empirical results, paving the way for the systematic identification and protection of vulnerable samples before they are memorized.
View blog
Resources
There are no more papers matching your filters at the moment.