Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

BibTex

Copy

@misc{zhang2025agenticcontextengineering,
      title={Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models},
      author={Qizheng Zhang and Changran Hu and Shubhangi Upasani and Boyuan Ma and Fenglu Hong and Vamsidhar Kamanuru and Jay Rainton and Chen Wu and Mengmeng Ji and Hanchen Li and Urmish Thakker and James Zou and Kunle Olukotun},
      year={2025},
      eprint={2510.04618},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.04618},
}

GitHub

ACE-open

HTTPS

https://github.com/sci-m-wang/ACE-open

SSH

git@github.com:sci-m-wang/ACE-open.git

CLI

gh repo clone sci-m-wang/ACE-open

AI Audio Lecture + Q&A

0:00 / 0:00

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Transcript

John: Welcome to Advanced Topics in Large Language Models. Today's lecture is on 'Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models,' or ACE. We've seen a lot of work recently on improving agentic reasoning, like 'Reflexion,' which focuses on iterative refinement. This paper, from researchers at Stanford and SambaNova Systems, pushes a similar idea but focuses on the context itself as a dynamic, evolving artifact. The core idea is moving away from static prompts. Yes, Noah? Noah: Hi Professor. When you say 'evolving artifact,' are you suggesting something more structured than the episodic memory we saw in the Reflexion paper? John: Precisely. Instead of just appending reflections, ACE treats the entire context as a comprehensive playbook to be curated and refined. This matters because it directly addresses key failure modes that emerge in long-running agentic systems, which is where many current methods struggle. John: The central contribution of ACE is tackling two specific problems the authors identify: 'brevity bias' and 'context collapse.' These are subtle but critical issues. Noah: Can you clarify 'brevity bias'? John: Certainly. Many prompt optimization methods implicitly or explicitly aim to create short, concise instructions. The assumption is that shorter is better. However, the ACE authors argue that for complex, domain-specific tasks, LLMs actually benefit from rich, detailed context. This includes specific strategies, examples of past failure modes, or detailed tool-use guidelines. Brevity bias can cause an optimizer to discard this crucial, nuanced information. Noah: And 'context collapse' is the opposite problem, then? John: In a way. Context collapse happens when an LLM is asked to iteratively rewrite its own context over time. The model tends to summarize and compress the information, gradually losing the very details that made the context effective in the first place. You end up with a high-level, generic summary, and performance degrades sharply. ACE is designed to prevent this by adding information incrementally rather than rewriting the entire context from scratch. Noah: So the goal is to create a comprehensive, growing knowledge base rather than a minimal, one-size-fits-all prompt. John: Exactly. They frame it as an 'evolving playbook.' This is a significant shift from thinking of context as a static, one-shot instruction to thinking of it as a long-term, curated memory that the agent actively maintains and learns from. John: The methodology to achieve this is a modular, three-part agentic architecture. You have a Generator, a Reflector, and a Curator, each with a specialized role. Noah: Is this setup similar to actor-critic models we see in reinforcement learning? John: There are conceptual parallels in the separation of action and evaluation, but the implementation here is quite distinct. The Generator is the 'actor' that executes the task using the current context playbook. The Reflector is the key innovation; it acts as a critic by analyzing the Generator's execution trace—its successes and failures—to distill concrete, actionable insights. It doesn't just say 'that was wrong,' it tries to identify the root cause of an error. Noah: And this reflection happens without human supervision? John: Correct. For agent tasks, it primarily relies on execution feedback from the environment, like whether a piece of code ran successfully or an API call returned an error. This is crucial for enabling self-improvement in autonomous systems. After the Reflector distills these lessons, the Curator takes them and integrates them into the playbook. Noah: Hold on, how does it avoid context collapse if it just keeps adding things? Wouldn't the context window fill up eventually? John: An important question. This is where two other key mechanisms come into play. First, it uses 'incremental delta updates.' Instead of rewriting the entire context, the Curator adds or modifies small, itemized bullets of information. This is computationally much cheaper and preserves existing knowledge. Second, it employs a 'grow-and-refine' mechanism. The playbook grows with new insights, but it is also periodically refined. This refinement step uses semantic embeddings to find and prune redundant or outdated bullets, keeping the playbook relevant and manageable without sacrificing critical detail. John: The empirical results are quite compelling. On the AppWorld benchmark, which tests agentic tool use, ACE enabled a smaller, open-source model to match and even surpass a top-ranked production-level agent powered by GPT-4.1. This suggests that sophisticated context engineering can be a viable alternative to simply scaling up model size for certain tasks. Noah: So this approach could make high-performance agents more accessible by improving smaller models? John: That is a significant implication. It could help democratize high-end performance. Furthermore, the efficiency gains are substantial. Compared to a baseline method like GEPA, they report over an 80 percent reduction in adaptation latency and a 75 percent reduction in the number of environment rollouts needed. This makes online, continuous learning much more practical for real-world deployment where cost and speed are constraints. Noah: The paper also mentions building on 'Dynamic Cheatsheet.' How does ACE extend that work? John: It directly addresses the limitations of that earlier framework. Dynamic Cheatsheet introduced the core idea of an adaptive external memory. ACE formalizes and improves the process with the dedicated Reflector and Curator roles, and it specifically designs the delta updates and refinement steps to combat the context collapse that could still occur in the original Dynamic Cheatsheet model. It's a more robust and structured implementation of that initial concept. John: So, the key takeaway is that by treating context not as a static prompt but as a structured, evolving playbook, we can create more robust, efficient, and self-improving LLM agents. The ACE framework provides a concrete methodology for doing this, addressing critical failure modes like brevity bias and context collapse that have limited previous approaches. John: This shift towards dynamic, rich context management is a promising direction for building more capable and autonomous AI systems. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models