Researchers from Goodfire AI and Anthropic demonstrated that mechanistic interpretability tools, specifically logit lens analysis, can unsupervisably decode ROT-13 encoded reasoning within a finetuned large language model. Their developed pipeline successfully reconstructed human-readable reasoning transcripts, showing robustness against simple forms of internal textual obfuscation.
There are no more papers matching your filters at the moment.