500 milliseconds. That's roughly the time it takes to blink twice.
It's also the threshold that separates a voice AI conversation that feels natural from one that feels broken.
When a caller finishes speaking and there's a pause before the agent responds, something happens in their brain. Under 500ms, the pause feels like normal conversational timing. Over 500ms, it feels like a lag. Over a second, callers start wondering if the call dropped. Over two seconds, they're already asking for a human.
Latency in voice AI isn't a technical metric. It's the primary driver of whether a caller trusts the agent they're talking to. And the LLM is where the most latency lives, and where most of the opportunity to reduce it exists.
This guide explains what LLM latency actually is, where it comes from, and the specific techniques that get production voice AI systems under 500ms.
What Is LLM Latency in Voice AI?
In a voice AI agent, LLM latency is the time between when the transcribed text of the caller's speech arrives at the language model and when the model produces its first output token, the first word of the response.
This is sometimes called Time to First Token (TTFT), and it's the number that matters most for real-time conversation. Callers don't experience the total generation time, they experience how quickly the agent starts responding. A response that begins in 300ms and streams naturally feels fast, even if the complete response takes another 400ms to finish generating.
LLM latency is one piece of the total response latency stack:
Total response latency = STT latency + LLM latency + TTS latency
In a typical deployment:
STT (streaming): 100–200ms
LLM (TTFT): 200–800ms
TTS (first audio): 80–200ms
At the high end, that's 1,200ms before the caller hears anything. At the low end, it's 380ms. The gap between those two experiences is the difference between a voice AI agent that works and one that doesn't.
LLM latency is both the largest variable and the most controllable. That's where optimisation effort pays off most.
Where LLM Latency Comes From
Before getting into how to reduce latency, it's worth understanding exactly where it comes from. There are three sources.

Network Round-Trip Time
The transcribed text has to travel from your infrastructure to the LLM provider's servers and the response has to travel back. If your voice AI infrastructure is in Singapore and the LLM endpoint is in the US, you're adding 150–200ms of network latency before the model has even started thinking.
This is pure physics. Light travels at a finite speed. Long-haul network routes add latency that no amount of model optimisation can eliminate. Geographic proximity between your infrastructure and the LLM endpoint is a real performance factor.
Model Inference Time
This is the time the model actually takes to process the input and generate the first output token. It's determined by the model's architecture (size, number of parameters, attention mechanism), the hardware it runs on (GPU type, memory bandwidth), and the current load on the inference cluster.
Larger models take longer than smaller ones. A 70B parameter model on shared infrastructure takes longer than a 7B parameter model on dedicated hardware. The relationship isn't linear, but the direction is consistent.
Context Window Size
The LLM processes the entire context, the system prompt, the conversation history, and the current user message, before generating each response. Longer contexts take longer to process.
A system prompt that's 500 tokens processes faster than one that's 3,000 tokens. A conversation that's been running for 20 turns has more history to process than one that's 3 turns in. Context window size is something you have direct control over, and it's one of the most commonly overlooked latency levers.
The 500ms Target: Why It Matters
500ms isn't an arbitrary number. It comes from research on human conversational timing and from production data on voice AI call outcomes.
In natural human conversation, the typical gap between one speaker finishing and the other responding is 200–300ms. Responses under 500ms feel normal. Responses between 500ms and 1 second feel slightly hesitant. Responses over 1 second feel like a technical problem.
| Response Latency | Caller Perception | Effect on Outcomes |
|---|---|---|
| Under 300ms | Instant, natural | Highest satisfaction, lowest transfer rate |
| 300–500ms | Normal, comfortable | Good satisfaction, acceptable transfer rate |
| 500ms–1s | Noticeable pause | Satisfaction drops, transfers increase |
| 1–2s | Feels broken | Significant abandonment |
| Over 2s | Call appears dropped | High abandonment, poor CSAT |
The goal isn't to hit exactly 500ms, it's to stay under it consistently across the p95 of calls, not just on average. An average of 400ms with a p95 of 1.2 seconds still produces a poor experience for a significant percentage of callers.
How to Reduce LLM Latency Below 500ms
There are seven specific techniques that production voice AI teams use to get LLM latency under control. The best deployments use several of them together.

Technique 1: Model Selection and Routing
The single highest-leverage latency reduction is choosing the right model for each query. GPT-4o is a remarkable model. It's also slower and more expensive than necessary for a large percentage of voice AI queries.
"What are your business hours?" doesn't need GPT-4o. A smaller, faster model handles it just as well, faster, and at a fraction of the cost.
Multi-LLM routing means classifying incoming queries by complexity and routing simple queries to faster, cheaper models (GPT-4o Mini, Claude Haiku, Gemini Flash) while reserving the full-size models for queries that genuinely require their reasoning capability.
In practice, 60–70% of voice AI queries are simple enough to be handled by smaller models without any drop in quality that callers would notice. The latency improvement for those queries can be 200–400ms. The cost improvement is 70–80%.
Technique 2: System Prompt Optimisation
System prompts are processed on every single LLM call. A 3,000-token system prompt adds meaningful processing time to every response. A 600-token prompt that achieves the same behavioural results is significantly faster.
Common system prompt issues that add unnecessary tokens: repeating the same instruction in multiple phrasings, including extensive example dialogues that could be in a knowledge base instead, and over-specifying edge cases that rarely occur in production.
Technique 3: Conversation History Management
Every turn of the conversation adds to the context the LLM processes. The solution is intelligent context management, summarising older turns rather than keeping them verbatim, retaining only the turns most relevant to the current query, and extracting key variables from the conversation (name, account number, stated intent) and storing them explicitly.
Well-implemented context management can reduce the tokens processed per call by 40–60% in long conversations, with no loss of conversational coherence.
Technique 4: Streaming Token Output
Streaming LLM output means the model sends tokens to the TTS engine as they're generated rather than waiting for the complete response. The TTS engine can start generating audio from the first few words while the LLM is still generating the rest. The caller starts hearing the response significantly before the LLM has finished producing it.
All major LLM providers support streaming. If a voice AI platform isn't using it, that's a significant architectural oversight.
Technique 5: Infrastructure Geography
Deploying your voice AI infrastructure in the same region as your LLM provider endpoints reduces network round-trip time meaningfully. Major LLM providers operate endpoints in multiple regions. Choosing the endpoint closest to your voice infrastructure is a straightforward optimisation that can save 50–150ms depending on current deployment geography.
Technique 6: Caching Common Responses
Some responses in a voice AI deployment are highly predictable. The answer to "what are your business hours" is always the same. The opening greeting is always the same.
Caching these responses, returning pre-generated audio rather than going through the full STT-LLM-TTS pipeline, eliminates latency entirely for those interactions. They return in milliseconds. Even caching 15–20% of interactions at near-zero latency significantly improves overall perceived responsiveness.
Technique 7: Speculative Prefill
If the caller has said "I need to reschedule my" and is still speaking, the system can already begin prefilling the context with appointment scheduling information, because the probability is high that a scheduling request is coming. When the full utterance arrives, the model has already done part of the work.
This technique can shave 100–200ms off response time for predictable query patterns when implemented carefully.
Measuring Latency in Production
Knowing your average latency isn't enough. You need to understand your latency distribution.
| Metric | What to Measure | Target |
|---|---|---|
| p50 latency | Median response time | Under 350ms |
| p95 latency | 95th percentile | Under 600ms |
| p99 latency | Worst 1% of responses | Under 1,000ms |
| Latency by query type | Simple vs complex | Separate targets for each |
| Latency under load | At 50, 100, 200+ concurrent calls | Max 20% degradation |
| Latency by time of day | Peak vs off-peak | Consistent across hours |
| ``` |
The p95 number is the one that matters most. Your median could be excellent while your p95 is terrible, and the callers in that 5% are having a poor experience that damages your metrics.
Multi-LLM Routing: The Highest-ROI Optimisation
Multi-LLM routing deserves its own section because it addresses latency, cost, and quality simultaneously.
| Query Type | Example | Recommended Model | Typical Latency |
|---|---|---|---|
| Simple FAQ | "What are your hours?" | GPT-4o Mini / Gemini Flash | 80–150ms |
| Standard transaction | "Book an appointment for Tuesday" | GPT-4o Mini / Claude Haiku | 120–200ms |
| Moderate complexity | "I need to dispute a charge" | GPT-4o / Claude Sonnet | 200–350ms |
| High complexity | Multi-step complaint with policy lookup | GPT-4o / Claude Sonnet | 300–500ms |
| Edge cases | Ambiguous intent, deep reasoning | GPT-4o / Claude Opus | 400–700ms |
| ``` |
With well-implemented routing, 60–70% of calls fall into the first two categories. Those calls see latency under 200ms, cost reductions of 70–80%, and no quality degradation.
What Sub-500ms Looks Like in Practice
A well-optimised production deployment achieving sub-500ms end-to-end response time:
| Component | Optimised Timing |
|---|---|
| STT (streaming) | 80–120ms |
| LLM routing classifier | 10–20ms |
| LLM inference (simple query) | 100–180ms |
| TTFT TTS (first audio) | 80–120ms |
| Total | 270–440ms |
For complex queries using a larger model:
| Component | Optimised Timing |
|---|---|
| STT (streaming) | 80–120ms |
| LLM routing classifier | 10–20ms |
| LLM inference (complex query) | 280–380ms |
| TTFT TTS (first audio) | 80–120ms |
| Total | 450–640ms |
Final Thought
Latency is not a backend metric. It's a caller experience metric.
Every 100ms you shave off the response time is 100ms less friction in every conversation your agent handles. Multiplied across hundreds or thousands of calls per day, small latency improvements compound into meaningfully better call outcomes.
The path to sub-500ms is not a single fix. It's a combination of model selection, prompt efficiency, context management, streaming output, and infrastructure geography, implemented together and measured rigorously in production.
Start with multi-LLM routing. It delivers the largest combined improvement on latency, cost, and quality of any single technique. Then work through the remaining optimisations in order of impact.
Your callers will notice the difference before you explain to them what changed.
Want to see how VoiceInfra handles multi-LLM routing and latency optimisation in production? Schedule a demo and we'll show you real latency numbers from live deployments.
Related reading:
7 Core Components of a Voice AI Agent Explained
How Speech-to-Text (STT) Works in Voice AI Agents
Text-to-Speech (TTS) for Voice AI: Why Voice Quality Matters



