A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows

BibTex

Copy

@misc{bandaraTue Dec 09 2025 16:23:05 GMT+0000 (Coordinated Universal Time)practicalguidedesigning,
      title={A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows},
      author={Eranga Bandara and Ross Gore and Peter Foytik and Sachin Shetty and Ravi Mukkamala and Abdul Rahman and Xueping Liang and Safdar H. Bouk and Amin Hass and Sachini Rajapakse and Ng Wee Keong and Kasun De Zoysa and Aruna Withanage and Nilaan Loganathan},
      year={Tue Dec 09 2025 16:23:05 GMT+0000 (Coordinated Universal Time)},
      eprint={2512.08769},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.08769},
}

AI Audio Lecture + Q&A

0:00 / 0:00

A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows

Transcript

John: In our course on Advanced MLOps and AI Systems Engineering, we often discuss the final-mile problem of deployment. Today's lecture is on 'A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows' by a team from Old Dominion University and Deloitte. We've seen a surge of papers that survey agent capabilities, like the 'Survey on Evaluation of LLM-based Agents', or define taxonomies, but this work pivots. It argues that the main challenge is no longer just building a clever prototype, but engineering a reliable system. It's about moving from the lab to production. John: Yes, Noah? Noah: Hi Professor. So, is this paper less about a new AI architecture and more about applying established software engineering principles to this new domain of agentic workflows? John: Precisely. The authors' core objective is to bridge that prototype-to-production gap. They aren't proposing a new type of agent; they're providing an end-to-end framework to prevent the creation of brittle, inconsistent, and high-maintenance agentic systems, which is a common outcome of ad-hoc development. Their motivation is to establish a disciplined engineering lifecycle. They formalize this into nine core best practices. These practices are designed to enhance determinism, observability, and maintainability—qualities essential for any production-grade software, but often overlooked in the rush to innovate with AI. Noah: Could you give an example of one of those best practices? John: Certainly. A key one is designing 'Single-Responsibility Agents'. Instead of creating one monolithic agent that can scrape a website, summarize text, and publish the result, they advocate for breaking that down. You'd have one agent that only scrapes, another that only summarizes, and so on. This simplifies the prompting for each agent, makes them easier to test and debug, and drastically reduces non-deterministic failures where an agent might miss a step or perform tasks out of order. It's a classic engineering principle applied to a new context. Noah: That makes sense. It sounds like modularity for LLMs. How did they demonstrate these principles in practice? John: They built a comprehensive case study: an automated multimodal podcast generation workflow. This system autonomously finds news articles, uses multiple LLMs to generate a script, synthesizes that script to resolve inconsistencies, produces audio and video files, and publishes everything to a GitHub repository. It’s a complex, multi-step process that would be highly prone to failure without a robust architecture. Two of their most critical insights are showcased here. The first is their approach to Responsible AI, which they call a 'multi-model consortium'. Noah: A multi-model consortium? How does that work? John: Instead of relying on a single model like GPT-4, they have several heterogeneous agents—say, one using Llama, one using Gemini, and one using GPT—independently generate a draft of the podcast script from the same source material. Then, a separate 'reasoning agent' analyzes these drafts. Its job is to cross-validate facts, identify and discard hallucinations or biased statements that appear in only one output, and synthesize a single, more reliable script. It's a consensus-based approach to improve factual accuracy and alignment. Noah: But doesn't running three or four LLMs in parallel for one task dramatically increase the cost and latency? That seems counterintuitive for a 'production-grade' system. John: That's an excellent point. There is an explicit trade-off. They are prioritizing correctness, reliability, and safety over raw cost and speed. For high-stakes applications like generating news content or compliance reports, the cost of a single hallucination or factual error can be far greater than the computational overhead. Their second key insight addresses efficiency elsewhere. For tasks that don't require LLM reasoning, like making a direct API call to publish a file to GitHub, they advocate for using direct function calls within the orchestration code rather than having an agent decide to use a 'publish' tool. This is faster, cheaper, and fully deterministic. Noah: So the takeaway is that the bottleneck for agentic AI is shifting from model capability to the engineering and orchestration frameworks around them? John: Exactly. This work suggests that the next wave of progress will come from disciplined system design, not just bigger models. It reframes the problem. Instead of asking 'how can we make an agent smarter?', it asks 'how can we build a reliable system out of potentially unreliable agents?'. This connects directly to the broader MLOps landscape, extending mature principles like containerization with Docker and orchestration with Kubernetes to these new, complex AI workflows. It professionalizes the field. Noah: Wait, one of their final principles is 'Keep it Simple, Stupid'. How does that align with a multi-model consortium and all this complex orchestration? John: A valid question. Their definition of 'simple' here refers to structural simplicity and transparent logic. They advocate for flat, rather than deeply nested, agent hierarchies. The orchestration logic is explicit in the code, not hidden inside an LLM's reasoning process. While the system has many components, the interactions between them are clearly defined and observable. Simplicity for them means debuggability and predictability, even if the overall component count is higher. John: To wrap up, this paper provides a much-needed engineering blueprint for agentic AI. It shifts the conversation from pure capability to production readiness, focusing on reliability, maintainability, and responsibility. The main takeaway is that for agentic AI to succeed in the real world, we must adopt a disciplined, engineering-first mindset. The challenge isn't just making agents smart; it's making the systems they inhabit dependable. John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows