I built Mark, a voice receptionist for an HVAC company. He runs on Retell, the platform we use to host the voice agent.
His job sounds simple. Pick up the phone, work out what the caller needs, take their details, and pass the lead to the team.
The trouble was pricing.
Mark handles two kinds of calls. The standard intake walks through name, address, property type, and the work the caller needs.
The pricing inquiry is different. The caller wants a number before they commit.
The job is to quote the flat price for standard work (tune-ups, diagnostic fees, service calls, thermostat installs), explain when a quote needs a technician visit instead, and capture details for a follow-up.
I held both jobs in one prompt.
The intake section said, "Do not quote prices; a technician will provide pricing after the assessment."
The pricing section said, "Quote the flat price from the pricing table for the service the caller has asked about."
Two rules. Direct opposites.
In practice, the agent picked whichever rule won the moment.
Two callers in a row could ask "how much for a tune-up?" and get different answers.
One would get the $89 price. The next would be told that a technician would call back. I could not predict which.
I tried to write my way out. Sub-rules and exception clauses, a guard that said: "quote prices only when the service appears in the pricing table."
Each addition fixed the test I was staring at and broke a different one.
The graders, the LLM judges I use to score test conversations, started throwing false positives.
They were flagging violations that had not happened because the rules had become hard to read.
That was when I split him.