AI Agent Evaluation, From Test Cases to Objectives

As agents proliferate throughout apps we need an easy way for engineers and teams to capture their expected AI behavior as tests or evals. This way everyone on a team feels comfortable leveraging agentic behavior. ## Why the Dataset Approach Fails The first approach you consider is building out a dataset to capture expected agent behavior. Have the engineer interact with the AI, get the tool call response they expect, then add it as a pass/fail test to the dataset. Sounds reasonable, except there are infinite conversational paths leading to any given outcome. To deterministically capture that our expected behavior always works, we'd theoretically need infinite tests. That's not just impractical—it's impossible. But even if you declare "good enough!" on a manageable test suite, you've only scratched the surface of deeper structural problems with the dataset approach. ### The Dataset Dilemma Even when you accept a finite dataset, the conventional approach of capturing successful interactions, labeling them, and building comprehensive test suites creates its own set of problems: **Brittleness**: What happens when we add a new tool or change a tool description? Our carefully curated dataset suddenly no longer reflects real-world scenarios. You end up in a dangerous situation where your tests pass but the agent no longer completes users' objectives in the real world—the tests reflect an outdated tool environment while users interact with the evolved system. **Bad data accumulation**: We inevitably add edge cases and outliers that don't represent typical user interactions, skewing our evaluation toward rare scenarios. **Maintenance overhead**: Every system change requires revisiting and potentially rebuilding large portions of our test suite. ## A Different Approach: Objective-Based Evaluation Instead of trying to capture every possible conversation, what if we focused on declaring agent objectives? At each release, we could track whether conversations successfully reach these objectives. The visual dashboard would show metrics against objectives, and the test becomes simple: Did the conversation reach the objective? No complex edge case handling—just a count of objectives reached. This shifts our focus from prescriptive test cases to outcome measurement. ### The Implementation Win There's a simple solution to create these types of evals. Using an LLM-as-a-judge approach, these evals can be added via evaluation platforms that handle running evals on live data and producing visual graphs showing the raw count or percentage of objectives reached. ## The Nuanced Problem: Writing Good Objectives The challenge lies in writing good objectives. Consider this scenario: as an engineer, I want to ensure the system creates a new integration via the create_integration tool rather than the edit_integration tool. I could write the objective: "calls create_integration when the user wants to build something new." But this creates its own problems: **Over-specification removes flexibility**: Sometimes there are multiple valid paths to the right outcome. By being too prescriptive about exact tool calls, we constrain the agent unnecessarily. **Under-specification misses the point**: If I write a vague objective like "helps the user build an integration," I might get passing scores when the agent merely provides helpful text responses instead of actually building something functional. The user gets information but no working outcome. ## The Fundamental Tension We're caught between two imperfect options: over-specifying objectives (which kills agent flexibility) or under-specifying them (which passes tests that don't actually help users). But we need to make forward progress. Here's what we've learned so far: Objectives are the right mental model: Rather than thinking in terms of conversation flows, we should think in terms of what we want to accomplish. Capturing datasets for future engineering help isn't the answer: Static datasets become stale too quickly in our rapidly evolving tool ecosystem. The sweet spot is functional objectives: We need objectives that are specific enough to measure meaningful outcomes but flexible enough to allow multiple solution paths. ### The Real Value of Objective-Based Evals This approach gives us these benefits: - Clear objective capture: Explicitly define what an agent in a certain domain should accomplish - Deployment impact visibility: See how engineering deployments affect objectives reached—if an objective suddenly drops after a deploy, we know where to start investigating ## Moving Forward If we accept that creating a dataset doesn't solve our problems because it becomes stale the moment we change any tool or add new capabilities, then we need evaluation methods that adapt with our system. Static test suites are fundamentally mismatched to the dynamic nature of AI agent development. I want to start building around this objective-based framework, even if it's imperfect. The alternative—trying to enumerate every possible conversation path—is guaranteed to fail at scale. The key is finding objectives that capture the essence of what we want the agent to accomplish without being so prescriptive that we eliminate valid alternative approaches. It's not a perfect solution, but it's a tractable one that can evolve with our system. Moving forward, the effort needed is to explore the ease of implementation, ease of understanding and adoption by teams, and whether this version of evals actually provides value rather than being extra noise. If it works, this could be a great approach to begin replicating across agents. **Next question: What objectives should we start with, and how do we define good objectives?**