Transcript
John: In our seminar on Advanced Methods in Reinforcement Learning, we've seen a lot of recent work trying to make long-chain-of-thought reasoning more efficient. Papers like 'ThinkPrune' and 'AdaptThink' focus on optimizing the reasoning path. Today's lecture is on 'The Markovian Thinker' from researchers at Mila and Microsoft Research, which takes a different path. It argues that the bottleneck isn't just the path itself, but the fundamental environment we use for RL. It challenges the assumption that an LLM needs its entire history to think effectively. Yes, Noah?
Noah: Hi Professor. So if it's not about pruning the thought process, what exactly is the bottleneck they're targeting? I thought the main cost was just generating a lot of tokens.
John: That's a good question. The token generation cost is part of it, but the bigger issue is the underlying attention mechanism in transformers. In standard Long-Chain-of-Thought RL, or LongCoT-RL, the state is defined as the initial problem plus all the reasoning tokens generated so far. As the chain of thought gets longer, the context window grows. For attention-based models, this leads to a quadratic increase in both computation and memory usage. If you double the length of your reasoning, you roughly quadruple the cost. This creates a hard ceiling on how long a model can 'think'.
Noah: So that quadratic cost is the real scalability problem.
John: Precisely. The main contribution of this paper is a new paradigm they call 'Markovian Thinking'. The core idea is to decouple the total length of the reasoning process from the size of the context window the model sees at any single step. The policy conditions on a constant-sized state, regardless of how many tokens have been generated overall. This breaks the quadratic scaling problem.
Noah: How can the model reason coherently if it doesn't have the full context? It sounds like it would lose track of its own thoughts.
John: That's the central challenge they address with their specific implementation, an RL environment called 'Delethink'. In Delethink, reasoning is broken into fixed-size chunks, say 8,000 tokens. The model reasons within that chunk as usual. But at the end of the chunk, the context is reset. The full history is discarded.
Noah: Wait, it just throws everything away? How does it continue?
John: It doesn't throw everything away. To maintain continuity, it uses a short textual carryover. The prompt for the next chunk consists of the original problem plus a small, fixed number of tokens from the very end of the previous chunk. This carryover becomes the Markovian state. The reinforcement learning process then incentivizes the model to learn how to make this carryover as dense and informative as possible, essentially teaching it to summarize its progress before starting the next step.
John: The computational implications are significant. Because the model only ever operates within a fixed-size chunk, the per-step computational cost becomes constant with respect to the total reasoning length. The overall process scales linearly with the number of chunks, not quadratically. Similarly, the memory required for the KV cache remains constant, which is a major benefit for training and inference on current hardware.
Noah: So did this actually work in practice? Does sacrificing the full context hurt performance on complex reasoning tasks?
John: That's what their experiments aimed to validate. They used a 1.5 billion parameter model and trained it on a math dataset. They found that their Delethink model, trained with 8K chunks up to a total budget of 24K tokens, performed on par with, and sometimes better than, a standard LongCoT-RL model trained with a full 24K context. This was true for both in-distribution math benchmarks like AIME and out-of-distribution tasks in question answering and code generation.
Noah: Okay, that's interesting. But what's the real advantage if it just matches the performance at the same total token budget?
John: The key advantage appears at test time. The standard LongCoT model's performance plateaus once it hits its 24K token training limit. The Delethink model, however, can continue generating new chunks and its performance keeps improving, even when reasoning for over 100,000 tokens. It learns a scalable, iterative reasoning process. And from a practical standpoint, the training is much cheaper. They estimate that for a very long average thinking length, Delethink reduced the required compute by nearly 75% compared to the LongCoT approach.
Noah: Another question, how did they determine the chunk size? It seems like a critical hyperparameter. If it's too small, the model can't do any meaningful reasoning within one block.
John: They performed an ablation study on that. While the 8K chunk size performed best, models trained with smaller chunks, like 4K or even 2K, still showed improvements over the base model. This suggests the approach is robust and can be adapted to environments with very tight memory constraints, even if there's a performance trade-off.
John: The most significant implication is that it reframes a major scaling challenge. Instead of needing new model architectures, like linear transformers or state-space models, to solve the quadratic cost problem for reasoning, we can redesign the RL environment itself. This work provides strong evidence that for reasoning tasks, a constant-sized state might be sufficient. It suggests that architectures like Mamba, which are inherently more efficient for long sequences, could be particularly well-suited for this kind of reasoning.
Noah: You mentioned the RL process teaches it to create good summaries. Was the model starting from scratch, or does it have some innate ability to do this?
John: That leads to one of their most interesting findings. They tested off-the-shelf LLMs and found they already exhibit what the authors call 'zero-shot Markovian traces'. When they applied the Delethink chunking-and-carryover mechanism to pre-trained models without any RL, the models were often able to continue their reasoning and recover much of their performance. This suggests that LLMs have a latent capability for this kind of compressed, state-based thinking, which provides a very strong starting point for the reinforcement learning fine-tuning.
John: To wrap up, this paper makes a strong case that for complex reasoning, the design of the learning environment is a powerful, and perhaps overlooked, lever for progress. By moving from an ever-growing context to a fixed, Markovian state, they achieve linear scaling in computation and constant memory usage. This allows models to reason for far longer than was previously practical, fundamentally addressing a key bottleneck in the field. The main takeaway is that sometimes, the most effective solution isn't to build a better agent, but to build a better environment for it to learn in.
John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.