alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

We're hiring
PaperBlogResources

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

BibTex
Copy
@misc{liTue Oct 07 2025 05:32:44 GMT+0000 (Coordinated Universal Time)intheflowagenticsystem,
      title={In-the-Flow Agentic System Optimization for Effective Planning and Tool Use},
      author={Zhuofeng Li and Haoxiang Zhang and Seungju Han and Sheng Liu and Jianwen Xie and Yu Zhang and Yejin Choi and James Zou and Pan Lu},
      year={Tue Oct 07 2025 05:32:44 GMT+0000 (Coordinated Universal Time)},
      eprint={2510.05592},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.05592},
}
GitHub
AgentFlow
463
HTTPS
https://github.com/lupantech/AgentFlow
SSH
git@github.com:lupantech/AgentFlow.git
CLI
gh repo clone lupantech/AgentFlow
AI Audio Lecture + Q&A
0:00 / 0:00
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
Transcript
John: Alright, welcome to Deep Reinforcement Learning Methods. Today's lecture is on a recent paper from researchers at Stanford titled 'IN-THE-FLOW AGENTIC SYSTEM OPTIMIZATION FOR EFFECTIVE PLANNING AND TOOL USE'. We've seen a lot of work on agentic systems, with frameworks like 'Reflexion' exploring verbal reinforcement and 'ARTIST' unifying tool use with RL. This paper, however, tries to solve the problem of training modular agents directly within their operational loop, moving beyond static or offline approaches. John: Yes, Noah? Noah: Hi Professor. You mentioned 'in-the-flow' optimization. Does that just mean on-policy learning, or is there something more specific to how they're defining it here? John: That's a great question. It is on-policy, but the 'in-the-flow' concept specifically refers to training the agent's planner module within the live, multi-turn dynamics of the entire agentic system. The policy is learning from rollouts generated by the full, interacting system of modules, not in a simplified or offline environment. This is the central challenge they're addressing. John: So, the main idea is to build a trainable agentic system called AGENTFLOW. Instead of a single, monolithic model that does everything, they decompose the task across four specialized modules. There's an Action Planner, which is the trainable policy that decides what to do next. Then a Tool Executor calls the chosen tool, like a web search or code interpreter. A Verifier checks if the task is complete, and finally, a Solution Generator produces the final answer. These modules communicate through a shared, evolving memory. Noah: This modular design sounds familiar. What prevents it from being just another complex, handcrafted system that's hard to optimize? John: The key contribution is precisely in making it optimizable. While past agentic systems were often static and relied on prompting, AGENTFLOW makes the planner a trainable policy. The core problem this creates, which you're all familiar with, is credit assignment. If a task takes ten steps and only succeeds or fails at the very end, how do you know which of the ten actions was good or bad? The reward is extremely sparse. Noah: Right, so how do they solve that? A single binary reward seems like a very weak signal for a long trajectory. John: Exactly. This is where their optimization algorithm, which they call Flow-based Group Refined Policy Optimization or Flow-GRPO, comes in. It's an interesting take on policy gradient methods. First, they run the full AGENTFLOW system to generate a trajectory. Then, they evaluate the final outcome with an LLM-as-judge to get a single, trajectory-level reward—a simple one for success, zero for failure. Noah: But how does that one reward value propagate back to earlier actions? John: Here's the crucial step. They broadcast that single trajectory-level reward to every single action taken within that trajectory. In effect, it converts the complex multi-turn problem into a series of simpler, single-turn policy updates. Every action in a successful trajectory is treated as 'good,' and every action in a failed one is treated as 'bad'. This simplifies credit assignment dramatically. Noah: Hold on, that seems like it would introduce a lot of noise. An early step could have been perfect, but a late mistake fails the whole trajectory, and that good early step still gets a reward of zero. Isn't that a problem? John: It would be, but they add a stabilization mechanism. Flow-GRPO runs a group of rollouts in parallel. The advantage for any given action is calculated not in isolation, but by normalizing its trajectory's reward against the mean and standard deviation of the entire group's rewards. This group normalization sharpens the signal. A trajectory that succeeds when most others fail gets a high positive advantage, strongly reinforcing its actions. This is different from methods like in 'Group-in-Group Policy Optimization', which tries for more fine-grained, hierarchical credit assignment. Flow-GRPO keeps it simpler at the trajectory level. John: The results of this approach are quite significant. The paper shows that their 7-billion parameter AGENTFLOW model consistently outperforms other specialized models and even much larger proprietary models like GPT-4o across a range of reasoning tasks. For example, it showed an average accuracy gain of nearly 15 percent over the best 7B baseline on search-intensive tasks. This suggests that a well-designed, trainable architecture can be more effective and efficient than simply scaling up a monolithic model. Noah: What about the ablation studies? Did they prove that the 'in-the-flow' RL part was necessary? What if they just used a frozen GPT-4o as the planner module? John: They did, and it's one of the most compelling parts of the paper. Using a frozen GPT-4o as the planner gave only a small performance boost. More tellingly, when they tried offline supervised fine-tuning on successful trajectories generated by GPT-4o, the performance catastrophically collapsed. This underscores that simply imitating good actions token-by-token is not enough. The agent needs to learn from the dynamic feedback and sparse success signals of the live environment, which is what Flow-GRPO enables. Noah: So the implication is that the field's focus on just scaling model size might be incomplete? John: Precisely. It suggests that system architecture and learning paradigm are critical variables. This work provides a strong argument that combining modular, agentic designs with robust on-policy RL is a highly effective path forward. It's a different direction from, say, 'Agentic Continual Pre-training', which aims to embed agentic skills into the foundation model itself. AGENTFLOW keeps the modules separate but makes their coordination learnable. John: To wrap up, AGENTFLOW presents a trainable agentic framework that effectively tackles the long-horizon credit assignment problem using a simple yet robust RL algorithm. It demonstrates that a smaller, well-trained modular system can outperform larger, more general models on complex tool-use and planning tasks. The main takeaway here is that enabling agents to learn directly 'in the flow' of interaction is a critical step toward building more adaptive and capable AI systems. John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.