Build AI Agents You Can Trust in Production.
Learn how to evaluate, test, and improve AI agents before deployment.
A hands-on bootcamp focused on the most critical skill in agent engineering: evaluation.
Building AI agents is easy.
Knowing whether they actually work is hard.
Most agent failures aren't caused by the model but they're caused by poor evaluation.
Teams often discover problems only after deployment:
- Agents that work in demos but fail in production
- Tool-calling workflows that silently break
- Prompt or model updates that introduce regressions
- Hallucinated actions and unreliable decisions
- Metrics that don't reflect real user outcomes
In this practical bootcamp, you'll learn the frameworks, metrics, and testing workflows used by leading AI teams to measure, diagnose, and improve AI agents with confidence.
You'll Learn How To:
- Measure agent performance beyond benchmark scores
- Evaluate reasoning, planning, and tool use
- Identify and diagnose agent failures
- Build realistic evaluation datasets
- Detect regressions before deployment
- Automate evaluations and use AI judges effectively
- Design a repeatable evaluation system for your agents
By the end of the workshop, you'll have a production-ready framework for evaluating and improving AI agents at scale.
What's Included
- 🎥 Live sessions + Access to all recordings and materials
- ✅ A practical evaluation framework you can apply immediately to your own systems
- 🤖 6 months of access to our AI Evals assistant.
- 🛠️ Hands-on exercises and implementation templates
- 📚 Earn a shareable certificate recognizing your participation and practical understanding of AI agent evaluation workflows.
- 🤝 Access to a community of AI practitioners
📅 June 27, 2026 ⏱ 5 hours
👨💻 Led by Ammar Mohanna, PhD
🏆 Packt Publishing Endorsed Certification
Instructor:
Ammar Mohanna is an AI engineer, researcher, and educator focused on building and evaluating large language model (LLM) systems for real-world production use. He holds a PhD in Edge Artificial Intelligence and has extensive experience designing, deploying, and assessing AI systems across enterprise, education, and applied machine learning domains. His work centers on evaluation frameworks, system reliability, and operational readiness for generative and multimodal AI systems. Ammar is also an adjunct professor and course designer, teaching machine learning, AI engineering, and generative AI to academic and professional audiences. Through his industry consulting and teaching, he helps teams move LLM-powered systems from prototypes to reliable, production-grade solutions.
Course Objectives:
- Explain where modern agents fail across tool calls, planning steps, trajectories, outputs, and adversarial inputs.
- Build component-level evaluations for tool selection, argument quality, and planning quality.
- Define and measure trajectory-level metrics including step count, cost, recovery, and loop detection.
- Build outcome-level evaluators using multi-dimensional rubrics and LLM-as-judge, then calibrate the judge against human labels.
- Identify adversarial failure modes unique to agents, especially indirect prompt injection through tool outputs.
- Decide which evaluation layer matters most for a specific use case and team maturity.
Why Agent Evaluation Matters
Building AI agents is easier than ever.
Evaluating whether they actually work is the real challenge.
Traditional software testing wasn't designed for systems that reason, plan, use tools, and make autonomous decisions.
Without robust evaluation systems, teams struggle to answer critical questions:
- Is the agent actually improving?
- Which changes introduce regressions?
- How reliable is the system in production?
- What should we optimize next?
The most successful AI teams don't just build agents—they build evaluation systems around them.
This bootcamp shows you exactly how.
The Production Agent Evaluation Framework
Throughout the bootcamp, you'll learn a practical framework for evaluating AI agents across four critical layers:
- Layer 1: Component Evaluation: Evaluate prompts, retrieval systems, tools, and model outputs independently.
- Layer 2: Trajectory Evaluation: Analyze how agents reason, plan, and make decisions throughout task execution.
- Layer 3: Outcome Evaluation: Measure whether the agent successfully achieved the user's objective.
- Layer 4: Adversarial Evaluation: Stress-test agents against edge cases, unexpected behaviors, attacks, and failures.
Most teams focus only on prompts and outputs.
Production teams evaluate all four layers.
What You'll Build
This isn't a theory-heavy workshop.
You'll actively work through real-world evaluation scenarios and build systems you can immediately apply in your own projects.
During the bootcamp you'll:
- Design evaluation strategies for AI agents
- Create high-quality evaluation datasets
- Evaluate agent trajectories and reasoning paths
- Analyze real agent failures
- Develop evaluation metrics that align with business outcomes
- Create adversarial test suites
- Build automated evaluation pipelines
- Implement AI judges and LLM-as-a-Judge workflows
- Design regression testing systems for continuous improvement
Capstone Project
Throughout the bootcamp you'll work on a practical evaluation project that brings together everything you learn.
You'll:
- Define success criteria for an AI agent
- Build an evaluation dataset
- Create meaningful evaluation metrics
- Analyze agent trajectories
- Identify failure patterns
- Design adversarial tests
- Build a repeatable evaluation workflow
You'll leave with a complete evaluation framework that can be adapted to your own agents and production systems.
Target Audience and Prerequisites:
Audience: AI/ML Engineers, Software Engineers, Data Scientists, Product Managers, and others working with or evaluating LLM-powered agents.
Prerequisites: Working knowledge of Python, familiarity with calling LLM APIs, conceptual understanding of LLMs, and a laptop capable of running Jupyter notebooks. Prior agent experience is helpful but not required.
Why Attend This Workshop
The biggest challenge in AI today isn't building agents.
It's knowing whether they're actually working.
The organizations that succeed with AI won't be the ones that build the most agents.
They'll be the ones that can reliably evaluate, improve, and trust the agents they deploy.
If you're building AI agents today or planning to, you need evaluation skills in your toolkit.
Join us and learn the practical evaluation frameworks used to build AI agents that can be trusted in production.
Lineup
Ammar Mohanna, PhD
Good to know
Highlights
- 4 hours
- Online
Refund Policy
Location
Online event
Agenda
-
Block 1 | Component-Level Evaluation + Short Agent Failure Refresher
Learn what makes AI agents fundamentally different from chat applications and why traditional evaluation methods often fail for agentic systems. This session covers common agent failure modes including incorrect tool usage, flawed arguments, looping behavior, hallucinated observations, and weak synthesis. You’ll explore how to evaluate tool selection, argument quality, planning behavior, and multi-step reasoning using practical component-level metrics and labeled datasets. The lecture also introduces common evaluation pitfalls such as label leakage and over-specified trajectories. Hands-On: Run a research assistant agent on a pre-built evaluation dataset, measure tool-selection accuracy and argument-match quality, and classify real agent failures using a practical evaluation taxonomy.
-
Block 2 | Trajectory Evaluation
Learn why evaluating agent trajectories matters just as much as evaluating final outputs. This session explores how reliable agents differ from inefficient or unstable ones, even when they produce the same answer. You’ll learn key trajectory evaluation concepts including golden vs. acceptable paths, loop detection, redundant tool calls, recovery behavior, latency, token cost, and trace observability. We’ll also cover how to identify failure patterns such as repeated calls, missing recovery, and unnecessary detours. In the hands-on exercise, participants will analyze 20 pre-recorded agent traces and write trajectory-level assertions to detect loops, excessive steps, duplicate tool calls, and failed recovery behavior.
-
Block 3 | Outcome Evaluation, LLM-as-Judge, and Regression
Learn how modern teams evaluate agent outputs beyond simple correctness metrics. This session covers factuality, completeness, safety, groundedness, and designing multi-dimensional evaluation rubrics for AI agents. Explore reference-based vs. reference-free judging, pairwise vs. rubric-style evaluation, and how to calibrate LLM-as-judge systems against human labels using correlation and agreement metrics. You’ll also examine common judge failure modes such as verbosity bias and self-preference bias, along with practical mitigation strategies. Finally, understand how evaluation evolves into production regression testing using stable benchmarks and repeated deployment checks. Hands-On: Run a pre-built LLM-as-judge over agent outputs, compare results against human labels, compute agreement metrics, and identify weak evaluation dimensions.