Transcript
John: Welcome to today's lecture in Mechanistic Interpretability of Neural Networks. Today's paper is "BASE MODELS KNOW HOW TO REASON, THINKING MODELS LEARN WHEN." We've seen a lot of recent work on making models more efficient, like in "AdaptThink" or "Thinkless," which focus on teaching models when to engage in costly reasoning. This work, coming from researchers at the University of Oxford and others, takes a more fundamental approach. It asks what the mechanistic difference between a base model and a fine-tuned reasoning model actually is. Yes, Noah?
Noah: Hi Professor. So, if the idea is that base models already have these skills, does the paper suggest that fine-tuning is just… inefficient? That we’re re-teaching things the model already has stored?
John: That’s a sharp way to put it. The authors would argue it's not about re-teaching, but about teaching a new, higher-level skill: orchestration. The core hypothesis is that base models possess the fundamental reasoning components, but thinking models primarily learn the sequence—the "when" and "how" to deploy them correctly. It's the difference between having a set of tools and having a blueprint for how to use them to build something.
John: To test this hypothesis, they took a two-step approach. First, they had to identify what these reasoning "tools" even are. Instead of defining them top-down, they developed an unsupervised method to discover a taxonomy of reasoning mechanisms directly from the thinking models' activations. Second, they used this taxonomy to build what they call a hybrid model. This model uses the logic of a thinking model to steer a base model, essentially telling it which reasoning tool to use at each step of a problem.
Noah: An unsupervised taxonomy? How did they ensure the categories they found were actually meaningful cognitive steps and not just statistical artifacts of the text?
John: Excellent question. That’s the first major part of their methodology. They collected sentence-level activations from several thinking models working on reasoning problems. Then, they trained a Sparse Autoencoder, or SAE, on these activations. The key was that they heavily constrained the SAE's latent dimension, forcing it to learn a very compact set of fundamental features that represent distinct modes of reasoning.
Noah: Why sentence-level? And why a constrained SAE? Usually with SAEs in interpretability, we see much larger feature sets to capture everything.
John: They argue the sentence is the right level of abstraction for a reasoning step—more granular than a full paragraph but more meaningful than a single token. The constrained SAE is crucial. They weren't trying to find every single feature in the model; they wanted to find the core dimensions of variation related to the reasoning process. This forces the model to cluster behaviors into broader, more interpretable categories like 'state a hypothesis,' 'perform a calculation,' or 'check a condition.' They then validated these discovered categories using other LLMs to score them for interpretability, consistency, and independence.
Noah: Okay, so once they have this taxonomy, how do they actually use it to "steer" a base model? Is this like adding a bias to the activations?
John: Exactly. For each reasoning category from their taxonomy, they identify a corresponding "steering vector." This is a specific direction in the base model's activation space. During generation, their hybrid system has a classifier, effectively mimicking the thinking model's logic, that decides which reasoning step is needed next. It then adds the corresponding steering vector to the base model's activations for a few tokens, nudging its generation in the intended direction.
Noah: So the steering isn't always on? It's applied dynamically?
John: Precisely. And that's a critical point. Their results showed this intervention was needed, on average, for only about 12% of the tokens in a given problem. Yet, this sparse intervention was able to recover over 80%, and in one case up to 91%, of the performance gap between the base model and the thinking model on difficult math benchmarks like GSM8K and MATH500.
Noah: That's a very significant recovery. But couldn't just adding any directional noise or a general 'reasoning' bias achieve a similar effect? How do we know it's the specific vectors and the timing that matter?
John: They tested that directly with ablation studies. Using only a general bias vector helped a little but was far less effective. Randomly firing the category vectors or using random unit vectors performed poorly. This provides strong evidence that it's the combination of the right vector at the right time that produces the effect. This finding helps reframe our understanding of methods like Chain-of-Thought prompting. It suggests CoT isn't just giving the model more space to think; it's providing an external structure that helps the model orchestrate these latent abilities it already possesses.
Noah: This seems to challenge conclusions from papers like "Implicit Reasoning in Transformers is Reasoning through Shortcuts," which suggest that base models don't really do robust step-by-step computation. Is this evidence that they can, they just need explicit guidance?
John: It certainly points in that direction. It suggests the 'shortcuts' might be a default behavior when the model lacks a clear plan. By providing that plan via steering, you can elicit a more robust, step-by-step process. The major implication here is a potential new view of training: pre-training is for acquiring fundamental mechanisms, while post-training is for learning deployment strategies. This could lead to far more efficient training methods focused on teaching orchestration rather than re-teaching capabilities.
John: So, the central takeaway is this clean decomposition. Base models appear to know how to perform fundamental reasoning steps. The primary value added by specialized fine-tuning is teaching them when to execute each step in a coherent sequence. It shifts our focus from simply building bigger models to understanding and controlling the cognitive-like processes happening within them.
John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.