A Survey of Reinforcement Learning for Large Reasoning Models

BibTex

Copy

@misc{zhang2025surveyreinforcementlearning,
      title={A Survey of Reinforcement Learning for Large Reasoning Models},
      author={Kaiyan Zhang and Yuxin Zuo and Bingxiang He and Youbang Sun and Runze Liu and Che Jiang and Yuchen Fan and Kai Tian and Guoli Jia and Pengfei Li and Yu Fu and Xingtai Lv and Yuchen Zhang and Sihang Zeng and Shang Qu and Haozhan Li and Shijie Wang and Yuru Wang and Xinwei Long and Fangfu Liu and Xiang Xu and Jiaze Ma and Xuekai Zhu and Ermo Hua and Yihao Liu and Zonglin Li and Huayu Chen and Xiaoye Qu and Yafu Li and Weize Chen and Zhenzhao Yuan and Junqi Gao and Dong Li and Zhiyuan Ma and Ganqu Cui and Zhiyuan Liu and Biqing Qi and Ning Ding and Bowen Zhou},
      year={2025},
      eprint={2509.08827},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.08827},
}

GitHub

Awesome-RL-for-LRMs

1595

HTTPS

https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

SSH

git@github.com:TsinghuaC3I/Awesome-RL-for-LRMs.git

CLI

gh repo clone TsinghuaC3I/Awesome-RL-for-LRMs

AI Audio Lecture + Q&A

0:00 / 0:00

A Survey of Reinforcement Learning for Large Reasoning Models

Transcript

John: Welcome to our course on Advanced Topics in Large Language Models. Today's lecture is on a comprehensive new paper, 'A Survey of Reinforcement Learning for Large Reasoning Models'. We've seen a lot of recent work like 'A Survey of Frontiers in LLM Reasoning' focusing on agentic systems, and this paper fits into that trend. It's a large collaboration led by researchers at Tsinghua University and Shanghai AI Laboratory, and it argues that the field is moving beyond using RL for simple alignment. John: Yes, Noah? Noah: Hi Professor. When you say 'beyond simple alignment', what's the key distinction they're making? I thought methods like RLHF were already about improving model capabilities. John: A fair question. RLHF primarily shapes a model's behavior to be helpful, harmless, and honest, based on human preferences. This survey argues for a shift towards using RL to incentivize the reasoning process itself. They frame this as Reinforcement Learning with Verifiable Rewards, or RLVR. This is about improving the model's intrinsic ability to solve complex, logical problems, not just its conversational style. Noah: So, what are these verifiable rewards? Are we just replacing human labelers with some kind of automated system? John: Essentially, yes. The core idea is to use rewards that are objective and can be checked automatically at scale. Think of a model solving a math problem. The reward isn't a human saying 'that looks good'; it's a binary signal of whether the final answer is correct. For a coding task, the reward could be whether the generated code passes a set of unit tests. This automatable feedback loop is what allows for massive scaling, which is a central theme of the paper. They call this the 'Verifier's Law' – that efficient RL depends on automatable verification. Noah: That makes sense for math and code, but what about tasks that aren't so black and white? John: That's a key challenge they address. For subjective tasks, the survey discusses the rise of Generative Reward Models, or GenRMs. Instead of outputting a simple score, these models provide nuanced, text-based feedback, almost like a peer reviewer. A recent trend is to train these reward models to first reason through a problem or use a rubric before delivering a judgment. This makes the reward signal itself more structured and interpretable. They also discuss unsupervised rewards derived from model consistency or self-generated knowledge to bypass annotation bottlenecks entirely. Noah: So on the technical side, what are the most critical components they identify for making this work? John: The paper dissects the process into three main pillars: reward design, which we just touched upon, policy optimization, and sampling strategy. For policy optimization, a key finding is the preference for critic-free algorithms in verifiable tasks. While PPO with a critic is common, it adds a lot of computational overhead and the critic can be a point of failure. Simpler, critic-free methods like GRPO are becoming more popular because they only need sequence-level rewards—like 'the final answer was correct'—which simplifies training. Noah: Wait, wouldn't a critic-free method struggle with credit assignment on long reasoning chains? John: It can, which is why reward design and sampling are so important. They compensate for that. The survey points to the use of dense rewards, providing feedback at the token or step level, to improve credit assignment. For sampling, instead of just generating linear sequences, they review structured methods like tree-based rollouts. This aligns the generation process with the problem-solving structure, making the final outcome-based reward more informative about which steps were valuable. This is how RL is being applied to improve everything from code generation in benchmarks like SWE-Bench to agentic tasks like web browsing and even controlling robots. Noah: So, what are the broader implications of this RLVR paradigm? Does this challenge the idea that we just need bigger models and more pre-training data? John: That's exactly the point. The authors propose that RLVR introduces a 'new scaling axis' for LLM capabilities. It complements the traditional scaling laws based on data and parameter count. This connects to findings in other papers, like NVIDIA's 'ProRL', which showed that prolonged reinforcement learning could expand a model's reasoning abilities beyond its pre-training. This survey frames it as a debate between 'sharpening' existing knowledge versus 'discovering' new skills. While the jury is still out, the evidence suggests that with stable RL, models can learn to compose existing skills in novel ways, which feels a lot like discovery. Noah: So you're saying this could be a more efficient path to more capable models? John: Potentially. It suggests a path where we can continuously improve a model's reasoning post-pre-training, using self-generated data and automated feedback. This is a very powerful concept. The paper sets a roadmap toward more autonomous, capable systems by focusing on challenges in continual RL, model-based RL, and even using RL during pre-training itself, which would be a significant departure from current practices. John: To wrap up, this survey provides a comprehensive framework for understanding a major shift in the field. It clarifies how Reinforcement Learning is evolving from a tool for behavioral alignment into a core engine for enhancing logical reasoning. The central takeaway is that by combining RL with verifiable, automated rewards, we unlock a new, scalable pathway for improving model intelligence. This is a critical direction as we move from Large Language Models to what this paper calls Large Reasoning Models. John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

A Survey of Reinforcement Learning for Large Reasoning Models