Follow me on LinkedIn - AI, GA4, BigQuery

I built Mark, a voice receptionist for an HVAC company. He runs on Retell, the platform we use to host the voice agent. 

His job sounds simple. Pick up the phone, work out what the caller needs, take their details, and pass the lead to the team.

The trouble was pricing.


Mark handles two kinds of calls. The standard intake walks through name, address, property type, and the work the caller needs. 

The pricing inquiry is different. The caller wants a number before they commit.

The job is to quote the flat price for standard work (tune-ups, diagnostic fees, service calls, thermostat installs), explain when a quote needs a technician visit instead, and capture details for a follow-up.


I held both jobs in one prompt.

The intake section said, "Do not quote prices; a technician will provide pricing after the assessment."

The pricing section said, "Quote the flat price from the pricing table for the service the caller has asked about."

Two rules. Direct opposites.


In practice, the agent picked whichever rule won the moment.

Two callers in a row could ask "how much for a tune-up?" and get different answers.

One would get the $89 price. The next would be told that a technician would call back. I could not predict which.


I tried to write my way out. Sub-rules and exception clauses, a guard that said: "quote prices only when the service appears in the pricing table."

Each addition fixed the test I was staring at and broke a different one. 


The graders, the LLM judges I use to score test conversations, started throwing false positives.

They were flagging violations that had not happened because the rules had become hard to read.

That was when I split him.

How did I decide?

When I stepped back, the failure mode was visible in the prompt itself. I was writing two prompts inside one, and one was bleeding into the other.

I call this the bleed test framework.


Four signs it has triggered.

  1. Two rules give different answers to the same caller's question. Both rules fire on the same trigger. In Mark's prompt, "how much for a tune-up?" triggered both the "do not quote prices" rule and the "quote $89" rule, and the agent picked whichever won the context.
  2. A rule has stacked exceptions. When you find yourself writing "always confirm the property address, unless it's an emergency, unless it's a pricing inquiry, unless the caller has already given it," the rule has run out of room. Each exception is an admission that the rule belongs in a different section.
  3. A rule fires during a call it was not written for. The intake section's "confirm property address" rule fires on a pricing-only call where no address was needed. The rule does not know which job is active, because the prompt has no clean boundary between the jobs.
    the
  4. The prompt is way past 3,500 tokens. Retell adds a surcharge above this threshold. The cost roughly doubles per call once you cross it, and at managed-service scale, that difference comes straight out of your margin. Latency also grows with size, but cost is the more immediate trigger. Cost is on next month's invoice. Latency is a judgment call about the caller experience.

Here is what the first sign looked like in the original Mark prompt. Two rules, both correct in their own domain, fighting inside the same prompt.

## Section 4: Pricing policy
Do not quote prices. A technician will provide pricing
after the assessment.
## Section 7: Standard service pricing
For tune-ups, quote $89. For service calls, quote $79.
For thermostat installs, quote $200.

After the split, each agent has one pricing rule, and that rule does not contradict anything else in its own prompt.

# Receptionist prompt
If the caller asks about pricing for any standard
service, transfer to the pricing agent.
# Pricing agent prompt
For tune-ups, quote $89. For service calls, quote $79.
For thermostat installs, quote $200. For services not
in the table, explain the price depends on the issue
and offer to book a diagnostic visit.

The receptionist no longer needs the "do not quote prices" rule, because it does not quote prices at all.

The pricing agent does not need a global "do not quote" guard because it has no other job that would need one.

Each rule has a single owner.

What changed?

Mark became two agents.

A receptionist for the standard intake, and a pricing agent who takes over when the caller is shopping rather than booking.

They share a voice and a name. The caller does not know there are two of them.

Each prompt shrank below the size at which the model loses grip on its own rules.

The grader false positives I had been chasing turned out not to be grader bugs.

They were readability bugs. Shorter, single-job prompts gave the grader something it could parse.

Think twice before splitting your voice agent. 

#1 The decision of splitting the agent should be made very early during the agent development.

Otherwise, you would have to develop two or more new agents from scratch, redesign all test cases from scratch, and retest everything from scratch.


#2 When you chose to split the agent into two or more, you are essentially doubling /tripling/quadrupling your work because all agents have to share the same rules regarding identity, knowledgebase, speaking styles etc.

So if you make changes to one agent, they need to be duplicated across other agents. If you do not keep them identical, the caller hears the seam when the handoff happens.

Every change to one agent is a change to every other agent. Two agents mean two copies of every rule update. Three agents means three. The duplication tax compounds for as long as the agents exist.


#3 Hands-off adds real latency. Each agent transfer takes roughly a second of dead air.

For a single transfer, this is tolerable.

For a multi-hop architecture (receptionist hands to pricing, pricing hands to booking, booking hands back to receptionist for the final close), you have three seconds of latency stacked into a 45-second call.

The caller feels it. Back-and-forth transfers are particularly bad because each one breaks conversational flow.


#4 Handoffs need bridging logic to sound human.

During the transfer gap, the source agent has to say something to fill the dead air.

The wrong phrasing breaks the disguise (telling the caller they are being transferred, when the same voice answers on the other side, sounds confused).

The right phrasing is a short bridging line that acknowledges what the caller said and ends with a verbal placeholder.

Getting this wrong makes the whole architecture sound robotic, no matter how clean the individual prompts are.

When to split the voice agent?

The two triggers that actually justify the cost of a split are these.

#1 Cost. The prompt has crossed way past the 3,500 tokens (say 6000 tokens) and you are paying the Retell surcharge on every call. At any meaningful call volume, this shows up on your monthly invoice. Splitting reduces token count on each constituent agent below the threshold and removes the surcharge. The cost case is decisive because it is measurable. You can put a number on it.

#2 You can’t seem to find workaround the following issues no matter how hard you try via a single agent: 

  1. Two rules give different answers to the same caller's question. 
  2. A rule has stacked exceptions. 
  3. A rule fires during a call it was not written for. 

Most rule conflicts can be solved with refactoring. Stacked exceptions become a router pattern. Contradictory rules merge into one rule with explicit branching. Scope leakage gets fixed by tightening rule conditions. Splitting is the right answer only when you have tried these and the fix is uglier than the split would be.

If neither trigger has fired, do not split. Tighten the prompt. Document the bleeding as known technical debt. The bleeding appears in the rules before it appears in the call. Read the prompt. But read the invoice and the test suite first.

  1. How to A/B Test in Retell AI.
  2. Automated Alerts in Retell AI to Monitor Voice AI Operations.
  3. Custom Reporting For Voice AI - Mini-Course.
  4. CRMs like GHL are overkill for building Voice AI Agents.
  5. How To Bill Your Voice AI Clients Like A Pro.
  6. Voice AI Knowledge Base Creation Best Practices.
  7. How to build Cost Efficient Voice AI Agent.
  8. When to Add Booking Functionality to Your Voice AI Agent.
  9. Without IP your AI company is worth nothing.
  10. AI Automation Agency Pricing Rules.
  11. How to Prevent Toll Fraud in Retell AI.
  12. Voice AI - Build once → Sell many → Collect monthly forever.
  13. State Machine Architectures for Voice AI Agents.
  14. Missing Context Breaks AI Agent Development.
  15. Avoid the Overengineering Trap in AI Automation Development.
  16. Retell Conversation Flow Agents - Best Agent Type for Voice AI?
  17. How To Avoid Billing Disputes With AI Automation Clients.
  18. Don't 'Build' AI Automation Workflows, 'Code' Them.
  19. Critical Aspect of Prompt Engineering - Domain Parameters.
  20. Zero Shot vs Single Shot vs Multi Shot Prompting.
  21. How to Build Reliable AI Workflows.
  22. Stop Building AI You Can't Fix.
  23. Automating 100% of your workflows is a disaster waiting to happen.
  24. How to build Voice AI Agent that handles interruptions.
  25. AI Automation Without CRM Is Useless for Business Growth.
  26. Structured Data in Voice AI: Stop Commas From Being Read Out Loud.
  27. Why Your Voice AI Sounds Robotic and How to Fix It.
  28. Why You Need an AI Stack (Not Just ChatGPT).
  29. AI Default Assumptions: The Hidden Risk in Prompts.
  30. Vibe Coding Fails Without Context and Expertise.
  31. How to make your Voice AI Agent Date & Time Aware.
  32. Why AI Agents lie and don't follow your instructions.
  33. How to Write Safer Rules for AI Agents.
  34. Two-way syncs in automation workflows can be dangerous.
  35. Using Twilio with Retell AI via SIP Trunking for Voice AI Agents.
  36. The Realistic Latency Target for Voice Agents.
  37. The required-field loop that breaks voice agents.
  38. Why your Voice prompt needs a clean-up pass.
  39. When to split your voice agent - The Bleed Test Framework.
  40. Abuse Ladder in Voice AI.