Callers don't read transcripts. They listen.
And within the first five seconds of a call, they've already made a judgement about the voice they're hearing. Is it natural? Does it sound like a real person? Or does it sound like a robot reading from a script?
That judgement sticks. It shapes how much patience they have for the rest of the conversation. It determines whether they stay on the line or ask for a human immediately. It affects whether they trust what the agent tells them.
Text-to-speech is the last thing a voice AI agent does before the caller hears anything. It's also one of the most underinvested components in most deployments. Teams spend weeks optimising their LLM prompts and days evaluating STT providers, then pick the default TTS engine because it ships with the platform.
That's a mistake. This guide explains why voice quality matters more than most people think, how TTS technology works, what separates good from bad, and how to evaluate TTS engines for production voice AI.
What Is Text-to-Speech in Voice AI?
Text-to-speech (TTS) is the component that converts the LLM's text response into audio that the caller hears. It's the voice of the AI agent.
In older systems, TTS worked by stitching together pre-recorded phoneme segments, which is why traditional IVR voices sounded robotic and mechanical. The speech was technically intelligible but clearly synthetic.
Modern TTS engines are entirely different. They use neural networks trained on thousands of hours of human speech to generate audio that mimics the natural patterns of human conversation, including pacing, intonation, breath patterns, and emotional tone.
The gap between 2018 TTS and 2026 TTS is enormous. The best current engines produce voices that are genuinely difficult to distinguish from a real person at conversational speed. Many callers who speak with AI-powered agents today don't realise they're talking to a machine.
That's not magic. It's the result of specific technical advances, and understanding them helps you make better decisions about which engine to use.
How Modern TTS Actually Works

Step 1: Text Normalisation
Before the neural network sees anything, the input text goes through normalisation. This converts numbers, abbreviations, dates, and special characters into their spoken equivalents.
"Your appointment is on 14/03 at 3pm" becomes "Your appointment is on the fourteenth of March at three PM." "Your account balance is $2,847.50" becomes "Your account balance is two thousand eight hundred and forty-seven dollars and fifty cents."
This step matters more than it sounds. A TTS engine that reads "3pm" as "three p m" instead of "three PM" or mispronounces a medication name because it doesn't know the abbreviation will erode caller trust instantly. Domain-specific normalisation, knowing that "mg" means "milligrams" in a healthcare context, significantly improves naturalness on specialised content.
Step 2: Linguistic Analysis
The normalised text is analysed for its linguistic structure. This includes part-of-speech tagging (is this word a noun or a verb?), prosodic phrasing (where should natural pauses fall?), and emphasis detection (which words carry semantic stress?).
This is where the engine determines that "I didn't say she stole the money" has seven different meanings depending on which word is stressed, and which one is correct given the context.
Step 3: Acoustic Synthesis
The linguistic representation is fed into a neural acoustic model that generates the actual audio waveform. Modern systems typically use one of two approaches.
Concatenative synthesis selects and stitches together fragments of recorded speech. Higher quality than old phoneme-stitching but still produces occasional audible seams between fragments.
Neural synthesis (used by ElevenLabs, OpenAI TTS, Cartesia, and others) generates audio from scratch using a model trained on human speech. No pre-recorded fragments, no stitching artifacts. The model learns the statistical patterns of natural speech and generates new audio that follows those patterns. This is why the best modern voices sound genuinely natural rather than assembled.
Step 4: Streaming Delivery
The generated audio is streamed back to the caller as it's produced, not delivered as a complete file once generation is finished. This is the TTS equivalent of streaming STT, it reduces the perceived latency of the response.
For a 10-word response, streaming TTS can begin playing the first few words to the caller while the remaining words are still being generated. The caller hears a response that starts quickly rather than waiting for the full audio file to be ready.
Why Voice Quality Has a Direct Business Impact
This isn't an aesthetic preference. Voice quality has measurable effects on call outcomes.

Abandonment rate. Callers who perceive the voice as robotic or unnatural request human transfer faster and abandon calls sooner. The effect is most pronounced in the first 30 seconds. If the opening greeting sounds mechanical, a significant percentage of callers will immediately ask for a human, regardless of how capable the underlying AI is.
Task completion rate. Callers are more patient with a natural-sounding agent. They're more willing to stay on the line through a multi-step workflow, provide information when asked, and follow instructions. A robotic voice creates friction that compounds across a long conversation.
Caller satisfaction (CSAT). Post-call surveys consistently show that voice quality is one of the top factors callers mention when rating an AI interaction. The LLM can handle the query perfectly, the integration can execute flawlessly, and the caller will still rate the experience poorly if the voice felt unnatural.
Brand perception. For businesses where the phone channel is a significant customer touchpoint, the voice of the AI agent is, in effect, the voice of the brand. A robotic-sounding agent communicates cheapness and low investment. A natural-sounding agent communicates quality.
The data from our own deployments is consistent with the broader industry pattern. Switching from a standard TTS engine to a premium neural voice typically produces a 15 to 25% reduction in early transfer requests, without any change to the underlying workflow or LLM.
What Makes a TTS Voice Sound Human
Most people can identify a synthetic voice but struggle to explain exactly what gives it away. Here are the specific characteristics that separate natural-sounding TTS from robotic-sounding TTS.
Prosodic variation. Human speech has natural variation in pitch, pace, and volume across a sentence and across a conversation. Robotic TTS tends to be monotone or uses simple rule-based pitch variation that sounds predictable and mechanical. Neural TTS models learn the statistical distribution of prosodic variation from human speech and apply it naturally.
Sentence-final intonation. Questions rise at the end. Statements fall. Incomplete sentences have a different pattern than complete ones. Getting this right requires genuine linguistic understanding of the sentence being spoken, not just a rule that says "add rising intonation if the sentence ends with a question mark."
Natural pause placement. Humans pause at clause boundaries, before important information, and occasionally mid-sentence when thinking. A voice that never pauses sounds unnatural. A voice that pauses only at punctuation marks sounds rigid. Neural models learn pause patterns from human speech and apply them in ways that feel organic.
Filler and bridging language. This is a content-level concern rather than a TTS concern, but it interacts with TTS quality. When an agent says "Give me just a moment to check that for you" in a natural voice, it sounds like a real person. When the same phrase is delivered in a robotic voice, the dissonance between the natural language and the synthetic delivery is jarring.
Breathing and micropauses. The best neural TTS engines include subtle breath sounds and very short pauses (under 50ms) between phrases, which are inaudible as individual events but contribute to the overall impression of naturalness. Their absence is one of the things that makes synthetic speech feel slightly "off" even when everything else sounds good.
TTS Provider Comparison
The major TTS providers differ significantly on voice quality, latency, language support, customisation options, and cost.

| Provider | Voice Quality | Streaming | Latency | Languages | Best For |
|---|---|---|---|---|---|
| ElevenLabs | ⭐⭐⭐⭐⭐ Best in class | ✅ Yes | ~180ms | 30+ | Highest naturalness, premium deployments |
| OpenAI TTS | ⭐⭐⭐⭐ Excellent | ✅ Yes | ~220ms | 50+ | Quality + broad language support |
| Cartesia | ⭐⭐⭐⭐ Excellent | ✅ Yes | ~90ms | 15+ | Lowest latency, real-time applications |
| PlayHT | ⭐⭐⭐⭐ Very good | ✅ Yes | ~200ms | 40+ | Voice cloning, custom voices |
| Google TTS | ⭐⭐⭐ Good | ✅ Yes | ~160ms | 60+ | Multilingual, Google Cloud stack |
| Azure TTS | ⭐⭐⭐ Good | ✅ Yes | ~180ms | 75+ | Enterprise compliance, widest language range |
A few things worth noting. ElevenLabs consistently produces the most natural-sounding output in head-to-head comparisons, particularly on emotional range and prosodic variation. Cartesia's latency advantage, under 100ms to first audio, is significant for real-time voice AI where every millisecond of response time matters. OpenAI TTS strikes a strong balance across quality, latency, and language support.
For enterprise deployments with compliance requirements (healthcare, financial services), Azure and Google have certifications and data residency options that ElevenLabs and Cartesia currently lack.
Latency: The Other Half of TTS Quality
Voice quality is one dimension of TTS evaluation. Latency is the other, and they pull in opposite directions.
The highest-quality neural TTS engines generate audio by running complex models that take time. The fastest engines make architectural tradeoffs that reduce quality to achieve lower latency. Choosing a TTS engine always involves a quality-latency tradeoff.
| Latency to First Audio | Caller Experience |
|---|---|
| Under 100ms | Imperceptible, response feels instant |
| 100–200ms | Natural, barely noticeable |
| 200–400ms | Slight pause, still acceptable |
| 400–700ms | Noticeable pause, slightly unnatural |
| Over 700ms | Conversation feels laggy, trust drops |
The key metric is time to first audio, not time to complete audio. Because modern TTS engines stream audio as they generate it, the caller starts hearing the response before the full audio is produced. A response that starts in 150ms feels fast even if the complete audio takes 600ms to generate.
This is why streaming TTS is essential for production voice AI, for the same reason streaming STT is essential. Sequential processing (wait for full audio, then play) adds unnecessary latency when parallel streaming is available.
Voice Customisation: Consistency Across Every Call
For business deployments, voice consistency matters. The AI agent should sound the same on every call, with the same name, the same persona, the same vocal characteristics.
Modern TTS platforms support this in several ways.
Pre-built personas. Every major TTS provider offers a library of voices with different gender presentations, accents, ages, and speaking styles. Selecting a consistent voice and using it across all deployments ensures callers have a coherent experience.
Voice cloning. Some platforms (ElevenLabs, PlayHT) allow you to create a custom voice by training on a sample of recorded speech. This can be used to create a unique brand voice, or to match an existing voice actor used in other customer communications.
Speaking style parameters. Most engines expose parameters for stability (how consistent the voice is across utterances), similarity to the reference voice, and speaking rate. Tuning these for your use case can meaningfully improve consistency and naturalness.
Emotional range. Some engines support emotion tags that shift the delivery, more empathetic for a complaint resolution conversation, more energetic for a sales qualification call, more measured for healthcare interactions. This is still an emerging capability but already useful in well-designed deployments.
TTS and the Overall Voice AI Stack
TTS is the last component in the chain, but its quality affects the perception of every component before it.
A perfectly transcribed call, correctly interpreted by the LLM, with flawless integration execution, will still produce a poor caller experience if the TTS voice sounds robotic. The caller's subjective experience of the entire interaction is filtered through the voice they hear.
This is why TTS quality is disproportionately important relative to how much attention it typically receives. Callers don't experience the STT accuracy rate or the LLM reasoning quality directly. They experience the voice. The voice is the interface.
How TTS connects to the rest of the stack:
LLM → TTS. The length and structure of LLM responses affects TTS naturalness. Short, conversational responses sound more natural than long, document-style responses, even with the same TTS engine. Designing your system prompt to generate responses appropriate for speech, not for reading, is part of TTS optimisation.
TTS latency → total response time. Like STT, TTS latency is additive. A 200ms TTS latency adds 200ms to every response. For high-frequency short exchanges, this compounds into a conversation that feels sluggish even if the LLM reasoning was fast.
TTS → CSAT. As noted above, voice quality is a primary driver of post-call satisfaction scores. Investing in TTS quality is one of the highest-ROI improvements available for deployed voice AI agents.
What to Look for When Evaluating TTS for Voice AI
| Requirement | What to Check |
|---|---|
| Streaming support | Does it stream audio as it generates, not after? |
| Time to first audio | How fast does the first word play? Target under 200ms |
| Naturalness on your content | Test with actual scripts from your use case |
| Number and date handling | Test with account numbers, dates, currencies |
| Domain vocabulary | Medical, legal, financial terms pronounced correctly? |
| Language and accent | Does it support the languages your callers speak? |
| Voice consistency | Same voice, same characteristics across sessions? |
| Emotional range | Does it handle empathy, urgency, and neutrality naturally? |
| Cost at scale | Per-character pricing across expected call volume |
The single most important test: listen to your actual content. Generate audio for the responses your agent gives most frequently in your deployment. Listen to them on a phone speaker, not studio headphones. That's what your callers will hear.
Final Thought
Voice quality is not a nice-to-have. It's the primary interface through which callers experience your entire voice AI deployment.
The LLM can be flawless. The integrations can be perfect. The orchestration can be bulletproof. If the voice sounds robotic, callers will feel like they're talking to a machine from 2010, and they'll behave accordingly.
Invest in TTS quality the same way you invest in LLM quality. Test on real content. Listen on real hardware. Measure abandonment rates and transfer requests before and after changing TTS providers.
The voice is the product. Treat it that way.
Want to hear the difference between TTS engines on real call scenarios? Schedule a demo with VoiceInfra and we'll play you the same script across multiple providers.
Related reading:
How Speech-to-Text (STT) Works in Voice AI Agents
7 Core Components of a Voice AI Agent Explained
What is a Voice AI Agent? How It Works, Components & Real Examples



