Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
BibTex
Copy
@misc{xiaThu Nov 20 2025 05:01:57 GMT+0000 (Coordinated Universal Time)agent0unleashingselfevolving,
title={Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning},
author={Peng Xia and Kaide Zeng and Jiaqi Liu and Can Qin and Fang Wu and Yiyang Zhou and Caiming Xiong and Huaxiu Yao},
year={Thu Nov 20 2025 05:01:57 GMT+0000 (Coordinated Universal Time)},
eprint={2511.16043},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.16043},
}
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
Transcript
John: Welcome to Advanced Topics in Autonomous Systems. Today's lecture is on a paper from researchers at UNC-Chapel Hill, Salesforce, and Stanford titled 'Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning.' The field has been dominated by reinforcement learning approaches like RLHF, but they all depend on massive, human-curated datasets. This creates a significant bottleneck. This paper proposes a way to sidestep that entirely. Yes, Noah?
Noah: Hi Professor. How does this differ from other self-evolution or self-play frameworks? Aren't they also trying to move away from human data?
John: An excellent question. Many existing self-evolution frameworks hit a capability ceiling. The model generates tasks based on its current knowledge, so the tasks rarely exceed its own complexity, leading to learning stagnation. Agent0 tries to solve this by creating a dynamic where two agents push each other to improve in a structured, competitive way.
Noah: Two agents?
John: Correct. The core of the methodology is a co-evolutionary loop between two agents initialized from the same base model. First, you have the Curriculum Agent, whose job is to generate challenging problems. Second, you have the Executor Agent, whose job is to solve those problems. They operate in an iterative cycle. In each iteration, the Curriculum Agent is trained with reinforcement learning to generate tasks that are just at the edge of the current Executor Agent's ability. Then, that new, harder curriculum is used to train and improve the Executor Agent. This creates a feedback loop where the curriculum gets harder as the solver gets smarter.
Noah: So what keeps the Curriculum Agent from just generating nonsense or a million variations of the same hard problem?
John: That's the critical part of its design. The Curriculum Agent isn't just trying to make hard problems; it's optimizing a specific, composite reward signal. This signal has three key components. First, an 'uncertainty reward' which encourages tasks where the Executor Agent shows low confidence—meaning the problem is not too easy and not impossibly hard. Second, a 'repetition penalty' that discourages generating tasks similar to others in the same batch, ensuring diversity. And third, and most importantly, a 'tool use reward.' This explicitly incentivizes the generation of tasks that require the Executor to use an external tool, like a code interpreter.
Noah: Wait, so the curriculum itself learns to integrate tools? That seems to be the key difference.
John: Exactly. This tool integration is what prevents the stagnation we talked about. By rewarding tool use, the curriculum forces the Executor to solve problems that may be beyond its innate reasoning capacity. As the Executor gets better at using tools, the Curriculum Agent must then generate even more complex, tool-reliant tasks to remain challenging. This creates a 'virtuous cycle' that drives both agents' capabilities upward simultaneously. The paper shows this clearly: over iterations, the generated tasks required progressively more tool calls.
Noah: Okay, that makes sense for the curriculum. But what about the Executor? How do you train it with RL when the 'correct' answer is also generated by a model and might be noisy or outright wrong?
John: That's the other main technical contribution. They introduce a method called Ambiguity-Dynamic Policy Optimization, or ADPO. First, for each problem, they generate multiple solutions and use majority voting to create a pseudo-label. They then calculate the model's self-consistency, which is the proportion of responses that agree with that majority answer. ADPO uses this self-consistency score to dynamically adjust the training process. For tasks with low agreement, where the pseudo-label is unreliable, it scales down the learning signal to prevent overfitting to a potentially incorrect answer. It also relaxes the training constraints for these ambiguous tasks, allowing for more exploration of alternative reasoning paths. This makes the training process robust to the inherent noise of self-generated data.
Noah: So it basically tells the model: 'don't be too confident in this update if the data itself is messy.' That's clever.
John: Precisely. And the results validate this approach. The method shows substantial improvements in mathematical and general reasoning over the base models, outperforming other self-improvement methods like R-Zero and even Socratic-Zero, which relies on external OpenAI APIs for assistance. Agent0 is fully self-contained.
Noah: What are the broader implications, then? Does this mean we can just take a base model and let it evolve into a super-reasoner for any domain without needing human experts?
John: That's the long-term vision this work points toward. The most significant impact is breaking the dependency on human-curated data for specialized agent training. This could dramatically reduce the cost and time needed to develop highly capable agents. Instead of meticulously crafting datasets, you design the evolutionary environment and the reward signals. It shifts the paradigm from data curation to mechanism design. It’s a step towards more autonomous AI that can continuously improve and discover new problem-solving strategies in complex domains, especially those that benefit from verifiable tools like code execution or mathematical provers.
John: To wrap up, the key takeaway from Agent0 is its blueprint for creating a self-sustaining learning ecosystem. By using a co-evolutionary loop and explicitly rewarding tool integration in the curriculum itself, the framework creates a virtuous cycle that pushes past the stagnation limits of previous self-play methods. It's a compelling demonstration of how agents can teach themselves to become better reasoners entirely from scratch. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.