Is this workshop beginner-friendly?

Participants should have basic Python knowledge and familiarity with LLM concepts. Prior experience with AI agents is helpful, but not required.

Do I need prior experience with AI agents?

Basic familiarity with LLMs and AI applications is helpful, but you don't need to be an expert in agent frameworks to benefit from this bootcamp.

Will I get access to the workshop materials afterward?

Yes. Participants will receive downloadable notebooks, slides, supporting materials, and recording access after the workshop.

What will I build or take away from this workshop?

You’ll leave with reusable evaluation workflows, adversarial testing examples, practical production-readiness checklists, and a stronger framework for evaluating real-world AI agents.

Will this workshop include hands-on exercises?

Yes, every session block includes guided hands-on notebook exercises designed to reinforce the evaluation concepts covered in the lectures.

Will I receive a certificate?

Yes, all participants will receive a Certificate of Completion from Packt after attending the workshop.

Why should I learn evaluation instead of spending more time building agents?

As AI agents move into production, evaluation has become one of the most valuable and sought-after skills in AI engineering.

Agent Evals Bootcamp

ByPackt Publishing Limited

Online event

Saturday, June 27 • 9 AM - 1 PM EDT

Overview

Build AI Agents You Can Trust in Production.

Learn how to evaluate, test, and improve AI agents before deployment.

A hands-on bootcamp focused on the most critical skill in agent engineering: evaluation.

Building AI agents is easy.

Knowing whether they actually work is hard.

Most agent failures aren't caused by the model but they're caused by poor evaluation.

Teams often discover problems only after deployment:

Agents that work in demos but fail in production
Tool-calling workflows that silently break
Prompt or model updates that introduce regressions
Hallucinated actions and unreliable decisions
Metrics that don't reflect real user outcomes

In this practical bootcamp, you'll learn the frameworks, metrics, and testing workflows used by leading AI teams to measure, diagnose, and improve AI agents with confidence.

You'll Learn How To:

Measure agent performance beyond benchmark scores
Evaluate reasoning, planning, and tool use
Identify and diagnose agent failures
Build realistic evaluation datasets
Detect regressions before deployment
Automate evaluations and use AI judges effectively
Design a repeatable evaluation system for your agents

By the end of the workshop, you'll have a production-ready framework for evaluating and improving AI agents at scale.

What's Included

🎥 Live sessions + Access to all recordings and materials
✅ A practical evaluation framework you can apply immediately to your own systems
🤖 6 months of access to our AI Evals assistant.
🛠️ Hands-on exercises and implementation templates
📚 Earn a shareable certificate recognizing your participation and practical understanding of AI agent evaluation workflows.
🤝 Access to a community of AI practitioners

📅 June 27, 2026 ⏱ 5 hours

👨‍💻 Led by Ammar Mohanna, PhD

🏆 Packt Publishing Endorsed Certification

Instructor:

Ammar Mohanna is an AI engineer, researcher, and educator focused on building and evaluating large language model (LLM) systems for real-world production use. He holds a PhD in Edge Artificial Intelligence and has extensive experience designing, deploying, and assessing AI systems across enterprise, education, and applied machine learning domains. His work centers on evaluation frameworks, system reliability, and operational readiness for generative and multimodal AI systems. Ammar is also an adjunct professor and course designer, teaching machine learning, AI engineering, and generative AI to academic and professional audiences. Through his industry consulting and teaching, he helps teams move LLM-powered systems from prototypes to reliable, production-grade solutions.

Course Objectives:

Explain where modern agents fail across tool calls, planning steps, trajectories, outputs, and adversarial inputs.
Build component-level evaluations for tool selection, argument quality, and planning quality.
Define and measure trajectory-level metrics including step count, cost, recovery, and loop detection.
Build outcome-level evaluators using multi-dimensional rubrics and LLM-as-judge, then calibrate the judge against human labels.
Identify adversarial failure modes unique to agents, especially indirect prompt injection through tool outputs.
Decide which evaluation layer matters most for a specific use case and team maturity.

Why Agent Evaluation Matters

Building AI agents is easier than ever.

Evaluating whether they actually work is the real challenge.

Traditional software testing wasn't designed for systems that reason, plan, use tools, and make autonomous decisions.

Without robust evaluation systems, teams struggle to answer critical questions:

Is the agent actually improving?
Which changes introduce regressions?
How reliable is the system in production?
What should we optimize next?

The most successful AI teams don't just build agents—they build evaluation systems around them.

This bootcamp shows you exactly how.

The Production Agent Evaluation Framework

Throughout the bootcamp, you'll learn a practical framework for evaluating AI agents across four critical layers:

Layer 1: Component Evaluation: Evaluate prompts, retrieval systems, tools, and model outputs independently.
Layer 2: Trajectory Evaluation: Analyze how agents reason, plan, and make decisions throughout task execution.
Layer 3: Outcome Evaluation: Measure whether the agent successfully achieved the user's objective.
Layer 4: Adversarial Evaluation: Stress-test agents against edge cases, unexpected behaviors, attacks, and failures.

Most teams focus only on prompts and outputs.

Production teams evaluate all four layers.

What You'll Build

This isn't a theory-heavy workshop.

You'll actively work through real-world evaluation scenarios and build systems you can immediately apply in your own projects.

During the bootcamp you'll:

Design evaluation strategies for AI agents
Create high-quality evaluation datasets
Evaluate agent trajectories and reasoning paths
Analyze real agent failures
Develop evaluation metrics that align with business outcomes
Create adversarial test suites
Build automated evaluation pipelines
Implement AI judges and LLM-as-a-Judge workflows
Design regression testing systems for continuous improvement

Capstone Project

Throughout the bootcamp you'll work on a practical evaluation project that brings together everything you learn.

You'll:

Define success criteria for an AI agent
Build an evaluation dataset
Create meaningful evaluation metrics
Analyze agent trajectories
Identify failure patterns
Design adversarial tests
Build a repeatable evaluation workflow

You'll leave with a complete evaluation framework that can be adapted to your own agents and production systems.

Target Audience and Prerequisites:

Audience: AI/ML Engineers, Software Engineers, Data Scientists, Product Managers, and others working with or evaluating LLM-powered agents.

Prerequisites: Working knowledge of Python, familiarity with calling LLM APIs, conceptual understanding of LLMs, and a laptop capable of running Jupyter notebooks. Prior agent experience is helpful but not required.

Why Attend This Workshop

The biggest challenge in AI today isn't building agents.

It's knowing whether they're actually working.

The organizations that succeed with AI won't be the ones that build the most agents.

They'll be the ones that can reliably evaluate, improve, and trust the agents they deploy.

If you're building AI agents today or planning to, you need evaluation skills in your toolkit.

Join us and learn the practical evaluation frameworks used to build AI agents that can be trusted in production.

Lineup

Headliner

Ammar Mohanna, PhD

Good to know

Highlights

4 hours
Online

Refund Policy

Refunds up to 10 days before event

Location

Online event

Agenda

09:00 AM - 09:50 AM

Block 1 | Component-Level Evaluation + Short Agent Failure Refresher

Learn what makes AI agents fundamentally different from chat applications and why traditional evaluation methods often fail for agentic systems. This session covers common agent failure modes including incorrect tool usage, flawed arguments, looping behavior, hallucinated observations, and weak synthesis. You’ll explore how to evaluate tool selection, argument quality, planning behavior, and multi-step reasoning using practical component-level metrics and labeled datasets. The lecture also introduces common evaluation pitfalls such as label leakage and over-specified trajectories. Hands-On: Run a research assistant agent on a pre-built evaluation dataset, measure tool-selection accuracy and argument-match quality, and classify real agent failures using a practical evaluation taxonomy.

10:00 AM - 10:50 AM

Block 2 | Trajectory Evaluation

Learn why evaluating agent trajectories matters just as much as evaluating final outputs. This session explores how reliable agents differ from inefficient or unstable ones, even when they produce the same answer. You’ll learn key trajectory evaluation concepts including golden vs. acceptable paths, loop detection, redundant tool calls, recovery behavior, latency, token cost, and trace observability. We’ll also cover how to identify failure patterns such as repeated calls, missing recovery, and unnecessary detours. In the hands-on exercise, participants will analyze 20 pre-recorded agent traces and write trajectory-level assertions to detect loops, excessive steps, duplicate tool calls, and failed recovery behavior.

11:00 AM - 11:50 AM

Block 3 | Outcome Evaluation, LLM-as-Judge, and Regression

Learn how modern teams evaluate agent outputs beyond simple correctness metrics. This session covers factuality, completeness, safety, groundedness, and designing multi-dimensional evaluation rubrics for AI agents. Explore reference-based vs. reference-free judging, pairwise vs. rubric-style evaluation, and how to calibrate LLM-as-judge systems against human labels using correlation and agreement metrics. You’ll also examine common judge failure modes such as verbosity bias and self-preference bias, along with practical mitigation strategies. Finally, understand how evaluation evolves into production regression testing using stable benchmarks and repeated deployment checks. Hands-On: Run a pre-built LLM-as-judge over agent outputs, compare results against human labels, compute agreement metrics, and identify weak evaluation dimensions.

Frequently asked questions

Organized by

Packt Publishing Limited

Report this event

Agent Evals Bootcamp

Learn how to evaluate, test, and improve AI agents before deployment.

You'll Learn How To:

What's Included

📅 June 27, 2026 ⏱ 5 hours

👨‍💻 Led by Ammar Mohanna, PhD

🏆 Packt Publishing Endorsed Certification

Instructor:

Course Objectives:

Why Agent Evaluation Matters

The Production Agent Evaluation Framework

What You'll Build

Capstone Project

Target Audience and Prerequisites:

Why Attend This Workshop

Lineup

Ammar Mohanna, PhD

Good to know

Location

Online event

Agenda

Block 1 | Component-Level Evaluation + Short Agent Failure Refresher

Block 2 | Trajectory Evaluation

Block 3 | Outcome Evaluation, LLM-as-Judge, and Regression

More events from Packt Publishing Limited

Discover more events from Packt Publishing Limited, from Science & Tech to other experiences you might love.

Still looking for the right event?

Explore all online events to browse and filter by date, category, and more.

Agent Evals Bootcamp

Learn how to evaluate, test, and improve AI agents before deployment.

You'll Learn How To:

What's Included

📅 June 27, 2026 ⏱ 5 hours

👨‍💻 Led by Ammar Mohanna, PhD

🏆 Packt Publishing Endorsed Certification

Instructor:

Course Objectives:

Why Agent Evaluation Matters

The Production Agent Evaluation Framework

What You'll Build

Capstone Project

Target Audience and Prerequisites:

Why Attend This Workshop

Lineup

Ammar Mohanna, PhD

Good to know

Location

Online event

Agenda

Block 1 | Component-Level Evaluation + Short Agent Failure Refresher

Block 2 | Trajectory Evaluation

Block 3 | Outcome Evaluation, LLM-as-Judge, and Regression

More events from Packt Publishing Limited

Discover more events from Packt Publishing Limited, from Science & Tech to other experiences you might love.

You might also like...

Browse more events with different dates, prices, and formats to find your next great experience.

Still looking for the right event?

Explore all online events to browse and filter by date, category, and more.