alphaXiv

History

Papers Benchmarks

Cambridge-Boston Alignment Initiative

06 Dec 2025

computer-science artificial-intelligence fine-tuning

Unsupervised decoding of encoded reasoning using language model interpretability

Anthropic Goodfire AI Cambridge-Boston Alignment Initiative

Researchers from Goodfire AI and Anthropic demonstrated that mechanistic interpretability tools, specifically logit lens analysis, can unsupervisably decode ROT-13 encoded reasoning within a finetuned large language model. Their developed pipeline successfully reconstructed human-readable reasoning transcripts, showing robustness against simple forms of internal textual obfuscation.

There are no more papers matching your filters at the moment.

Events

AI for Law
Joel Niklaus· Hugging Face
01/09
Register
Watch recordings

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Unsupervised decoding of encoded reasoning using language model interpretability

Events

AI for Law

Personalize Your Feed