If you build voice AI agents, you run tests. You write success criteria. You let an AI grader score the transcripts. Most of the time this works well.
But sometimes the grader fails the agent for the wrong reason.
I hit this last week. My voice agent ran an integration test in which a caller refused her email, then, mid-call, reported a burning smell from her electrical panel.
The agent did exactly what the prompt required. It accepted the email refusal without looping. It switched to the Emergency Flow when the burning smell came up.
It did not ask diagnostic questions. It batch-confirmed the remaining details, issued the emergency close, and fired end_call.
The grader still marked it as failed.
In the case of Retell AI, this is a voice AI grader (the test cases you created under 'Simulation Testing'):

The reason the grader gave:
"When the burning smell is reported, the agent does not immediately switch to Emergency Flow. Instead, the agent continues to confirm details. This is a diagnostic/clarifying question, not an immediate escalation."
I checked the prompt. The Emergency Flow has eight explicit steps. Step seven is a batch confirm. The agent did the batch confirm because the flow requires it.
The grader read "switches to Emergency Flow" as "skips everything and closes immediately." That is not what the prompt says. That is what the grader assumed it ought to say.
The grader invented a rule that was not in the prompt.
This is what I call a grader false positive.
The agent works. The test fails anyway. If you trust the grader without checking, you will fix things that are not broken and break things that work.
How to spot a grader's false positive.
Read the failure reason out loud. Then check the rule it cites against the actual prompt.
In my case, three signals tell me the grader is wrong, not the agent.
The first signal is rule invention.
The grader cites a rule that sounds plausible but does not exist in the prompt. In my case, the prompt defines the Emergency Flow as an eight-step procedure. The grader treated it as "drop everything and close." The grader's version is intuitive, but it is not what the prompt says.
The second signal is overcorrection from a real concern.
Emergency calls are time-sensitive. A grader trained on "be fast in emergencies" can over-apply that into "skip every step that is not the close." The instinct is reasonable. The verdict is wrong because the prompt has already accounted for that instinct by defining a tight, minimum-field emergency procedure.
The third signal is a contradiction with the prompt itself.
If following the grader's rule would force the agent to break a different rule, the grader is wrong. In my case, skipping the batch confirm would have meant breaking the prompt's Emergency Flow definition outright. The grader's expectation and the prompt's instructions cannot both be right.
How to fix it?
When a test fails, check three things in order.
- Did the agent break a real rule that exists in the prompt? If it doesn't, the grader invented something, that's a false positive.
- If the rule exists, check whether the agent's behaviour really violates it or whether the grader misread the transcript. Misreading is possible too.
- If the rule exists and the behaviour really violates it, the failure is real. The fix is on the prompt side (make the rule more enforceable) or the test side (if the test was unrealistic).
The fix is to the success criteria, or to the example interactions in the test, or to a quick check whether the grader has misread the same kind of test before.
A grader false positive looks identical to an agent failure on the surface. Reading the transcript is the only reliable way to tell them apart.
LLM graders are useful but not infallible.
They pattern-match the same way the agents do.
They have intuitions about what a "good emergency response", a "clean close", or a "proper refusal" should look like.
Sometimes those intuitions are sharper than the prompt's actual rules. When that happens, the grader fails the agent for not matching its intuition, even when the agent matches the prompt exactly.
If you treat every grader failure as an agent failure, you will overfit your prompt to the grader's intuitions instead of to real user needs.
You will add rules to handle false positives. You will weaken behaviours that the prompt got right. The agent gets worse in production while the test scores get better.
Other Articles on Voice AI.
- Voice AI Knowledge Base Creation Best Practices.
- How to build Cost Efficient Voice AI Agent.
- When to Add Booking Functionality to Your Voice AI Agent.
- Without IP your AI company is worth nothing.
- AI Automation Agency Pricing Rules.
- How to Prevent Toll Fraud in Retell AI.
- Voice AI - Build once → Sell many → Collect monthly forever.
- State Machine Architectures for Voice AI Agents.
- Missing Context Breaks AI Agent Development.
- Avoid the Overengineering Trap in AI Automation Development.
- Retell Conversation Flow Agents - Best Agent Type for Voice AI?
- How To Avoid Billing Disputes With AI Automation Clients.
- Don't 'Build' AI Automation Workflows, 'Code' Them.
- Critical Aspect of Prompt Engineering - Domain Parameters.
- Zero Shot vs Single Shot vs Multi Shot Prompting.
- How to Build Reliable AI Workflows.
- Stop Building AI You Can't Fix.
- Automating 100% of your workflows is a disaster waiting to happen.
- How to build Voice AI Agent that handles interruptions.
- AI Automation Without CRM Is Useless for Business Growth.
- Structured Data in Voice AI: Stop Commas From Being Read Out Loud.
- Why Your Voice AI Sounds Robotic and How to Fix It.
- Why You Need an AI Stack (Not Just ChatGPT).
- AI Default Assumptions: The Hidden Risk in Prompts.
- Vibe Coding Fails Without Context and Expertise.
- How to make your Voice AI Agent Date & Time Aware.
- Why AI Agents lie and don't follow your instructions.
- How to Write Safer Rules for AI Agents.
- Two-way syncs in automation workflows can be dangerous.
- Using Twilio with Retell AI via SIP Trunking for Voice AI Agents.
- The Realistic Latency Target for Voice Agents.
- The required-field loop that breaks voice agents.
- Why your Voice prompt needs a clean-up pass.
- When to split your voice agent - The Bleed Test Framework.
- Abuse Ladder in Voice AI.
- Understanding Retell AI Transfer Screening Agents.
- The 80/20 Rule of Voice Agent Development.
- Retell AI Current Time Awareness has a reliability problem.
- Every dynamic variable in voice agent needs a fallback.
- When your Voice AI grader is wrong, not your agent.