Build AI Agents You Can Trust in Production. Master Evaluation, Reliability, and Adversarial Testing for Modern Agentic Systems.
AI agents are moving into production faster than most teams know how to evaluate them. Unlike chatbots, agents plan, call tools, recover from failures, and operate across complex multi-step trajectories, where a simple “correct answer” metric misses critical reliability and safety risks.
In this live 4 hour hands-on workshop, you’ll learn how modern AI teams evaluate agents across four essential layers: component behavior, trajectories, outcomes, and adversarial robustness. Through guided notebooks and practical exercises, you’ll test how agents make decisions, handle failures, produce outputs, and respond under attack scenarios.
By the end of the session, you’ll leave with reusable evaluation workflows, a practical production-readiness framework, and the confidence to assess whether your own AI agents are truly ready for deployment.
📅 June 27, 2026 ⏱ 4 hours
👨💻 Led by Ammar Mohanna, PhD
🏆 Packt Publishing Endorsed Certification
Instructor:
Ammar Mohanna is an AI engineer, researcher, and educator focused on building and evaluating large language model (LLM) systems for real-world production use. He holds a PhD in Edge Artificial Intelligence and has extensive experience designing, deploying, and assessing AI systems across enterprise, education, and applied machine learning domains. His work centers on evaluation frameworks, system reliability, and operational readiness for generative and multimodal AI systems. Ammar is also an adjunct professor and course designer, teaching machine learning, AI engineering, and generative AI to academic and professional audiences. Through his industry consulting and teaching, he helps teams move LLM-powered systems from prototypes to reliable, production-grade solutions.
Course Objectives:
- Explain where modern agents fail across tool calls, planning steps, trajectories, outputs, and adversarial inputs.
- Build component-level evaluations for tool selection, argument quality, and planning quality.
- Define and measure trajectory-level metrics including step count, cost, recovery, and loop detection.
- Build outcome-level evaluators using multi-dimensional rubrics and LLM-as-judge, then calibrate the judge against human labels.
- Identify adversarial failure modes unique to agents, especially indirect prompt injection through tool outputs.
- Decide which evaluation layer matters most for a specific use case and team maturity.
Why this Workshop:
- Live instructor-led workshop: Learn directly from an expert through a highly interactive, real-time learning experience focused on practical agent evaluation techniques.
- Hands-on guided notebooks: Follow step-by-step exercises using pre-built notebooks designed to help you apply evaluation concepts immediately.
- Reusable agent evaluation workflows: Get practical evaluation patterns and workflows you can adapt for your own AI agents and production systems.
- Adversarial evaluation mini-suite: Experiment with real-world attack scenarios and learn how to test agents for reliability and robustness under adversarial conditions.
- Production evaluation checklist: Receive a practical checklist to help assess whether an AI agent is truly ready for production deployment.
- Downloadable workshop materials: Access all slides, notebooks, and supporting resources for continued learning after the workshop.
- Recording access after the workshop: Revisit key concepts and hands-on exercises with access to the session recording after the event.
- Live Q&A with the instructor: Get your technical questions answered and discuss real-world evaluation challenges directly with the instructor.
- Certificate of Completion from Packt: Earn a shareable certificate recognizing your participation and practical understanding of AI agent evaluation workflows.
Target Audience and Prerequisites:
Audience: AI/ML Engineers, Software Engineers, Data Scientists, Product Managers, and others working with or evaluating LLM-powered agents.
Prerequisites: Working knowledge of Python, familiarity with calling LLM APIs, conceptual understanding of LLMs, and a laptop capable of running Jupyter notebooks. Prior agent experience is helpful but not required.
Lineup
Ammar Mohanna, PhD
Good to know
Highlights
- 4 hours
- Online
Refund Policy
Location
Online event
Agenda
-
Block 1 | Component-Level Evaluation + Short Agent Failure Refresher
Learn what makes AI agents fundamentally different from chat applications and why traditional evaluation methods often fail for agentic systems. This session covers common agent failure modes including incorrect tool usage, flawed arguments, looping behavior, hallucinated observations, and weak synthesis. You’ll explore how to evaluate tool selection, argument quality, planning behavior, and multi-step reasoning using practical component-level metrics and labeled datasets. The lecture also introduces common evaluation pitfalls such as label leakage and over-specified trajectories. Hands-On: Run a research assistant agent on a pre-built evaluation dataset, measure tool-selection accuracy and argument-match quality, and classify real agent failures using a practical evaluation taxonomy.
-
Block 2 | Trajectory Evaluation
Learn why evaluating agent trajectories matters just as much as evaluating final outputs. This session explores how reliable agents differ from inefficient or unstable ones, even when they produce the same answer. You’ll learn key trajectory evaluation concepts including golden vs. acceptable paths, loop detection, redundant tool calls, recovery behavior, latency, token cost, and trace observability. We’ll also cover how to identify failure patterns such as repeated calls, missing recovery, and unnecessary detours. In the hands-on exercise, participants will analyze 20 pre-recorded agent traces and write trajectory-level assertions to detect loops, excessive steps, duplicate tool calls, and failed recovery behavior.
-
Block 3 | Outcome Evaluation, LLM-as-Judge, and Regression
Learn how modern teams evaluate agent outputs beyond simple correctness metrics. This session covers factuality, completeness, safety, groundedness, and designing multi-dimensional evaluation rubrics for AI agents. Explore reference-based vs. reference-free judging, pairwise vs. rubric-style evaluation, and how to calibrate LLM-as-judge systems against human labels using correlation and agreement metrics. You’ll also examine common judge failure modes such as verbosity bias and self-preference bias, along with practical mitigation strategies. Finally, understand how evaluation evolves into production regression testing using stable benchmarks and repeated deployment checks. Hands-On: Run a pre-built LLM-as-judge over agent outputs, compare results against human labels, compute agreement metrics, and identify weak evaluation dimensions.