Most teams I have spoken to aim for around 800 ms end-to-end latency for voice agents.

Here is why I think that number keeps showing up.
Where does 800 ms come from?
Humans usually reply to each other within 200 to 500 ms. Anything slower than that and the brain starts noticing gaps and adjusting how it speaks.
Voice AI cannot yet match human speed.
But sub-800 ms (from the moment the caller stops speaking to the first audio byte back) is the point where calls still feel smooth.
Several industry guides recommend it as the design target.
Anything between 800 and 1200 ms is acceptable, but callers start to notice and slow down.
So 800 ms is not a rule from theory. It is the practical sweet spot between ideal user experience and what you can actually deliver in production.
How the budget breaks down.
A typical sub-second voice agent budget looks something like this:
- End of speech detection: 150 to 200 ms.
- Speech to text: 100 to 200 ms.
- LLM first token: 200 to 500 ms.
- Text to speech first byte: 80 to 200 ms.
- Network overhead: 20 to 100 ms.
If every component hits target, you land somewhere between 550 and 900 ms.
That is why most engineering guides treat sub-800 ms as the primary KPI.
What I notice at different speeds.
Under 500 ms feels almost human. Magical, but hard to hit consistently.
300 to 800 ms is the sweet spot. Callers keep a natural rhythm. Few "hello, are you there" moments.
800 to 1200 ms still works for business calls. But callers start pausing longer between turns. It feels more like a slightly laggy IVR than a real conversation.
Above 1300 ms is where things break down. More talk-overs. More repeated sentences. Callers feel something is wrong even when the agent answers correctly.
How I would set targets.
I would not promise 800 ms 100% of the time. I would treat it as a design centre and wrap SLAs around it:
- p50 under 800 to 1000 ms. p50 of 800 ms means 50% of calls respond in under 800 ms. The other 50% are slower. It tells you the typical experience. But it hides the bad calls.
- p95 under 1800 to 2000 ms. p95 of 1800 ms means 95 out of 100 calls respond in under 1800 ms. Five out of 100 are slower. This tells you what your worst calls look like. The ones where callers actually feel the lag.
- Hard max under 2500 to 3000 ms. A hard max of 2500 ms means no call should ever take longer than 2.5 seconds. If it does, you investigate.
Some vendors push harder. p50 under 500 ms, p95 under 800 ms. But 800 ms is still the line where calls stop feeling good.