Transcript
John: In our seminar on Advanced Language Model Architectures, we've discussed the ongoing debate about what reinforcement learning actually does to a model's reasoning abilities. We've seen papers like 'Echo Chamber' from Harvard suggesting RL mainly amplifies pre-trained behaviors. Today's lecture is on 'On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models,' a paper from Carnegie Mellon University that tries to untangle this very question.
John: The authors argue that the field lacks clarity because we don't have enough control over the training pipeline. This work introduces a fully controlled framework to isolate the causal effects of each stage. Yes, Noah?
Noah: Hi Professor. So, is the core idea that if you can't control what the model sees during pre-training, you can't really claim that RL is teaching it something new versus just refining what it already knew from some obscure corner of the internet?
John: Precisely. The main contribution here is creating a synthetic, controllable world for the model to learn in. This allows them to precisely measure what each training phase—pre-training, mid-training, and RL—actually contributes. Their goal is to resolve the ambiguity around RL's effectiveness by asking: what is the interplay between these stages in shaping reasoning? To do this, they evaluate models on two axes. First, 'extrapolative generalization,' which is the ability to solve problems deeper or more complex than anything seen in training. Think of it as vertical scaling of reasoning. Second, 'contextual generalization,' which is the ability to apply the same logical steps to a completely new domain or context—a horizontal transfer of skills.
Noah: And this synthetic world, is it just abstract logic problems, or does it try to mimic natural language?
John: It's a clever balance. They use a framework based on GSM-Infinite, where each problem is a directed acyclic graph of simple arithmetic operations. This gives them precise control over logical complexity. But then they use templates—like 'animals in a zoo' or 'teachers in a school'—to render these logical graphs into diverse, natural-language word problems. This allows them to control the underlying logic and the surface-level context independently, which is crucial for testing that contextual generalization.
Noah: Okay, that makes sense. But with synthetic data, aren't you worried the model just overfits to the specific templates and structures you've designed?
John: That's a valid concern, and they address it through their evaluation. They train the model on problems up to a certain complexity, say 10 operations, and then test it on problems with 15 or 20 operations. This directly tests for extrapolation beyond the training data. For context, they train on 'zoo' problems and test on 'school' problems to see if the reasoning transfers. The key is that the training, mid-training, and test sets are all disjoint, so there's no data contamination.
Noah: How do they prevent reward hacking? It seems easy for a model to guess the right final answer without following the correct steps, especially in arithmetic.
John: This is one of the most critical parts of their methodology. Instead of just rewarding the correct final answer, they use 'process-level rewards'. The model has to output its entire reasoning trace, which is then parsed back into a dependency graph. This predicted graph is compared to the ground-truth graph. A full reward is only given if every single intermediate step is correct and the final answer is right. By adding this process verification reward, they found it significantly mitigates reward hacking and improves the fidelity of the reasoning itself, leading to better generalization.
Noah: And what was the role of this 'mid-training' stage they emphasize?
John: They found mid-training to be a powerful, often overlooked, lever. It's essentially a supervised fine-tuning phase on higher-quality data that bridges the gap between broad pre-training and the more targeted RL phase. By using mid-training to solidify skills on problems at the 'edge of the model's competence'—things it can almost do but not perfectly—they establish stronger reasoning priors. RL can then build upon this stronger foundation more effectively, leading to better performance with the same overall compute budget.
John: So, how does this research shift the field? It moves the conversation from 'Does RL work?' to 'Under what conditions does RL work?'. It reconciles conflicting results by showing RL isn't a magic bullet. It only provides true capability gains when the model has 'headroom' to improve and the RL data is carefully curated to its edge of competence. This explains why some studies, like 'ProRL', see expansion, while others see mere amplification. It depends entirely on the setup.
Noah: So the takeaway for someone building a model is that you can't just throw a generic RL algorithm at a pre-trained model and expect new reasoning skills to emerge. The curriculum matters enormously.
John: Exactly. You need a curriculum. Pre-training should provide broad, even if minimal, exposure to the building blocks of reasoning in many contexts. Mid-training should sharpen those skills to a point of near-competence. And finally, RL should be used to push the model just beyond that frontier. It's about designing a synergistic pipeline, not just stacking independent stages on top of each other. This controlled, causal approach provides a much clearer roadmap for that.
John: The main takeaway here is that building reasoning models requires a deliberate, methodical approach to curriculum design. The effectiveness of any training stage is highly conditional on the stages that came before it. By systematically controlling the entire data pipeline, this work provides a principled guide for how to orchestrate pre-training, mid-training, and RL to build models that can generalize their reasoning abilities in both depth and breadth.
John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.