Follow me on LinkedIn - AI, GA4, BigQuery
0:00
/3:37

I stripped the formatting from my voice agent prompt to save tokens. Here is what it cost me.

When I started building voice agents, I treated the prompt like a document for myself. Headings, bullets, bold text, blank lines between sections. I thought the formatting was for me, not for the model. I was wrong.

I ran a test. Two voice agents. Identical instructions. Identical model. Identical voice. 

Both prompts told the agent the same things. Same identity. Same tone rules. Same rhythm rules. Same confirmation rules. Same scripted lines. Same flow order. Same key rules. Same exception handling.


The well-formatted version used markdown conventions.

Section headers with ## markers. Bold names for rules using **Tone:** syntax. Bulleted lists with dashes. Numbered flow steps. Quotation marks around scripted lines. Em-dashes to separate clauses. Blank lines between sections.

The stripped version had none of those markers.

Every header is gone. Every bullet is gone. Every quotation mark around scripted lines is gone. Most punctuation is gone. All the rules run together as continuous prose.

The token count for the well-formatted agent is 1418-3558 tokens, and for the unformatted version is 1223-3363 tokens.

The absolute drop is exactly 195 tokens at both ends of the range. That's the consistent saving from removing the formatting.


Calculating the percentage at each end:

At the lower end (1418 → 1223): 195 / 1418 = 13.8% drop

At the upper end (3558 → 3363): 195 / 3558 = 5.5% drop

So, depending on where in the range your actual prompt sits, the saving is somewhere between 5.5% and 13.8%.


The token count dropped by about 5 to 6 percent on production-scale prompts. That's a real saving, but the behaviour cost was higher than the saving was worth.

What I expected to happen.

Honestly, I expected the difference to be subtle. The agent had the same instructions either way.

The model would parse the prose version and apply the rules just as it did with the markdown version. Maybe a few percent slower or less reliable, but nothing dramatic.

This expectation was based on a vague intuition that modern models are robust. They handle a lot of input variation. Surely a bit of formatting compression couldn't matter much.

What actually happened.

The first agent handled stacked input correctly. Acknowledged each piece. Deflected the diagnosis. Moved through the flow cleanly.

The second agent dropped pieces. Engaged the diagnosis question instead of deflecting. Got the order of acknowledgements wrong. Sounded rushed and slightly confused.

Same instructions. Different formatting. Audibly different calls.

Markdown markers do three things that help the model.

#1 They explicitly mark structural boundaries. 

  • A section header tells the model "what follows is a new topic." 
  • A bullet point tells the model "this is one item in a list of items." 
  • A numbered step tells the model "this is part of a sequence." 
  • Without these markers, the model has to infer structure from word patterns alone, which is harder when the content is rule-heavy.

#2 They mark what's content versus what's instruction. 

  • Quotation marks around a line tell the model "say this verbatim." 
  • Bold around a rule name tells the model "this is the name of a behaviour pattern." 
  • Italics often mark emphasis or variables. 
  • Without these markers, every piece of text is just words competing for attention. 
  • The model has to work harder to figure out which words are instructions to follow and which are examples to imitate.

#3 They match how the model was trained. 

  • Modern language models have seen vast amounts of markdown. Documentation. READMEs. Technical writing. Blog posts. The model has internalised what markdown means and how to navigate it efficiently.
  • Continuous prose with rule-heavy content is rarer in the training data, and the model handles it less reliably.

Stripping formatting removes the markers that help the model parse the content. The tokens you save are small. The cost in reliability is higher.

I had assumed formatting was just a visual decoration that helped me read the prompt, but didn't matter to the model. That's wrong. The model parses formatting too. The markers are doing work.

This isn't an argument against plain English in voice agent prompts. 

Short conversational prompts work fine in plain prose. Personality descriptions work fine in plain prose. Narrative framing works fine in plain prose.

The argument is specifically about rule-heavy prompts. 


When the prompt contains lots of conditional behaviours, scripted lines that must be said verbatim, hierarchical decision logic, exception handling, and ordered flow steps that kind of content benefits from explicit structure. Stripping the structure makes the rules harder to retrieve and apply reliably.


Plain English with paragraph breaks, topic sentences, and clear organisation can work too. 

The agents I tested differed at the extreme end. Markdown on one side, wall-of-text prose on the other.

Most real prompts sit somewhere between those extremes, but the principle is still the same: structure helps the model.