Will there be Q&A sessions during the workshop?

Yes. Each session will have dedicated Q&A time, and you can also ask questions in the chat during the workshop.

What platform will be used?

The workshop will be hosted live on Airmeet.

How can I ask questions during the workshop?

You can use the chat or Q&A features during the session. A moderator will help manage and relay questions to Imran

Will I get access to slides, code samples, and exercises after the event?

Yes. All participants will receive the presentation slides, code examples, and any relevant resources after the event.

Will there be recordings of the event?

Yes. Recordings will be made available to all attendees so you can review the material at your own pace.

What happens if I cannot attend live?

You will still receive access to the recordings and materials so you can catch up later.

Eval Driven Development for Engineers

ByPackt Publishing Limited

0 followers126 events2y hosting5.6k total attendees

Part of the Deep Engineering collection

Online event

Saturday, May 30 • 11 AM - 4 PM EDT

Overview

Helping Software Engineers Turn AI Prototypes into Production-Ready Systems | Live and Hands-On Workshop

Why This Matters Now

AI is no longer confined to prototypes. It is being integrated into products, workflows, and decision-making systems across industries. As usage grows, so does the impact of failure. Inconsistent outputs, silent regressions, and hallucinations are not edge cases anymore. They are operational risks.

Without a systematic way to evaluate and monitor these systems, teams are left guessing. Eval-Driven Development brings clarity and discipline, making reliability measurable and repeatable rather than subjective.

What You Will Learn

How to move from subjective “vibe-based” testing to a structured Eval-Driven Development (EDD) approach for AI systems
Define correctness using semantic evaluation instead of simple string matching
Build and maintain a “Golden Dataset” with real-world scenarios, edge cases, and production context
Design and implement an LLM-as-Judge for scalable, automated evaluation
Create scoring rubrics and calibrate them against human judgment
Integrate evaluation into CI/CD pipelines with clear release gates and regression testing
Manage cost, latency, and reliability trade-offs in production systems
Perform debugging and root cause analysis for failures in probabilistic systems
Build a production readiness gate that prevents unreliable AI outputs from going live

What You’ll Walk Away With

A practical framework for building and maintaining reliable AI systems
A working evaluation pipeline you can adapt to your own use cases
A structured dataset that defines correctness for your application
A repeatable process for testing, debugging, and improving system performance
A production readiness approach that reduces risk before deployment

What Makes This Different

This is not a workshop about prompt tips or model comparisons. It focuses on how to build systems that behave reliably over time.

It treats evaluation as a core part of development, not an afterthought
It emphasizes real-world workflows over isolated examples
It focuses on debugging and failure analysis, not just output quality
It connects AI development with established software engineering practices
It is designed for teams that need to ship and maintain systems, not just experiment

Who Should Attend

Software engineers and systems engineers building AI-powered or LLM-based applications
AI/ML engineers and practitioners working on reliability, evaluation, or deployment of models
Technical leads and architects responsible for production-grade AI systems
Product engineers integrating LLMs into user-facing features
Teams transitioning from prototyping to production with AI systems
Anyone frustrated with unreliable outputs and looking for a systematic, test-driven approach to AI development

Limited Seats. High Impact.

This is a live, interactive workshop with limited seats to maintain quality and hands-on depth.

Helping Software Engineers Turn AI Prototypes into Production-Ready Systems | Live and Hands-On Workshop

Why This Matters Now

What You Will Learn

How to move from subjective “vibe-based” testing to a structured Eval-Driven Development (EDD) approach for AI systems
Define correctness using semantic evaluation instead of simple string matching
Build and maintain a “Golden Dataset” with real-world scenarios, edge cases, and production context
Design and implement an LLM-as-Judge for scalable, automated evaluation
Create scoring rubrics and calibrate them against human judgment
Integrate evaluation into CI/CD pipelines with clear release gates and regression testing
Manage cost, latency, and reliability trade-offs in production systems
Perform debugging and root cause analysis for failures in probabilistic systems
Build a production readiness gate that prevents unreliable AI outputs from going live

What You’ll Walk Away With

A practical framework for building and maintaining reliable AI systems
A working evaluation pipeline you can adapt to your own use cases
A structured dataset that defines correctness for your application
A repeatable process for testing, debugging, and improving system performance
A production readiness approach that reduces risk before deployment

What Makes This Different

This is not a workshop about prompt tips or model comparisons. It focuses on how to build systems that behave reliably over time.

It treats evaluation as a core part of development, not an afterthought
It emphasizes real-world workflows over isolated examples
It focuses on debugging and failure analysis, not just output quality
It connects AI development with established software engineering practices
It is designed for teams that need to ship and maintain systems, not just experiment

Who Should Attend

Software engineers and systems engineers building AI-powered or LLM-based applications
AI/ML engineers and practitioners working on reliability, evaluation, or deployment of models
Technical leads and architects responsible for production-grade AI systems
Product engineers integrating LLMs into user-facing features
Teams transitioning from prototyping to production with AI systems
Anyone frustrated with unreliable outputs and looking for a systematic, test-driven approach to AI development

Limited Seats. High Impact.

This is a live, interactive workshop with limited seats to maintain quality and hands-on depth.

Lineup

Imran Ahmad

Good to know

Highlights

5 hours
Online

Refund Policy

Refunds up to 5 days before event

Location

Online event

Agenda

11:00 AM - 12:00 PM

Session 1: Building the “Golden Dataset”

Imran Ahmad

Before improving outputs, you need a clear definition of what “good” looks like. In this module, participants will focus on creating a high-quality “Golden Dataset” that reflects real-world usage. This includes designing examples that go beyond simple inputs to capture edge cases, ambiguity, and production context. You will also learn how to evaluate outputs based on meaning and intent rather than exact matches, and how to maintain and version your dataset as requirements evolve.

12:10 PM - 01:10 PM

Session 2: Architecting the “LLM-as-Judge”

Imran Ahmad

Manual evaluation quickly becomes a bottleneck. This module introduces a scalable alternative by building an automated evaluation layer using an LLM as a judge. You will design structured scoring rubrics that assess dimensions like accuracy, tone, and tool usage. The focus is not just on automation, but on reliability. You will learn how to calibrate your evaluation system against human judgment and identify where human review is still necessary.

01:20 PM - 02:20 PM

Session 3: The “Green-Red” Loop and CI Integration

Imran Ahmad

Reliable systems require continuous testing, not one-time validation. In this module, you will integrate evaluation into a development workflow that mirrors modern software practices. This includes defining performance thresholds for release decisions, setting up regression tests to catch unintended changes, and incorporating evaluation checks into CI pipelines. You will also explore how to balance evaluation depth with cost and latency constraints in real-world environments.

Frequently asked questions

Organized by

Packt Publishing Limited

Followers--

Events126

Hosting2 years

Report this event

Eval Driven Development for Engineers

Why This Matters Now

What You Will Learn

What You’ll Walk Away With

What Makes This Different

Who Should Attend

Limited Seats. High Impact.

Why This Matters Now

What You Will Learn

What You’ll Walk Away With

What Makes This Different

Who Should Attend

Limited Seats. High Impact.

Lineup

Imran Ahmad

Good to know

Location

Online event

Agenda

Session 1: Building the “Golden Dataset”

Session 2: Architecting the “LLM-as-Judge”

Session 3: The “Green-Red” Loop and CI Integration

More events from Packt Publishing Limited

Discover more events from Packt Publishing Limited, from Science & Tech to other experiences you might love.

Still looking for the right event?

Explore all online events to browse and filter by date, category, and more.

Eval Driven Development for Engineers

Why This Matters Now

What You Will Learn

What You’ll Walk Away With

What Makes This Different

Who Should Attend

Limited Seats. High Impact.

Why This Matters Now

What You Will Learn

What You’ll Walk Away With

What Makes This Different

Who Should Attend

Limited Seats. High Impact.

Lineup

Imran Ahmad

Good to know

Location

Online event

Agenda

Session 1: Building the “Golden Dataset”

Session 2: Architecting the “LLM-as-Judge”

Session 3: The “Green-Red” Loop and CI Integration

More events from Packt Publishing Limited

Discover more events from Packt Publishing Limited, from Science & Tech to other experiences you might love.

More Science & Tech events

Browse more Science & Tech events with different dates, prices, and formats to find your next great experience.

Still looking for the right event?

Explore all online events to browse and filter by date, category, and more.