Every team building a AI voice agent hits the same architectural fork early in the project: cascaded pipeline or real-time speech-to-speech model. The marketing around both options tends toward absolutes. Real-time is faster. Pipelines are more controllable. Neither statement is wrong, and neither is the full picture.
The architecture you choose shapes your latency, your cost structure, your debugging experience, your compliance posture, and your flexibility to swap components as the market moves. This post lays out both approaches honestly, including where each one actually breaks down.
Understanding the Two Architectures
Both architectures share the same input and output: audio in, audio out. What sits between them is fundamentally different.
The Cascaded Pipeline (STT-LLM-TTS)
The cascaded pipeline processes voice through three sequential stages. An ASR (Automatic Speech Recognition) model converts the user's audio to text. The text goes to a large language model for reasoning and response generation. The generated text then passes to a TTS (Text-to-Speech) engine, which synthesizes audio and sends it back to the user.
Each stage is a discrete, independently swappable component. You choose your ASR provider, your LLM, and your TTS voice independently based on accuracy, cost, language support, and latency requirements for each layer.
The Real-Time Speech-to-Speech Model
Speech-to-speech (S2S) models take audio input directly and produce audio output without converting to text in between. OpenAI's Realtime API and Google Gemini 2.5 Flash Live are the most prominent implementations. The model processes audio natively, preserving acoustic information, emotional tone, and prosody that a text-based pipeline discards at the ASR stage.
There is no visible handoff between stages. The model handles the full input-to-output path as a single system.
Latency: Where Real-Time Models Have a Genuine Advantage
The cascaded pipeline's structural weakness is accumulation. Each stage completes before the next begins in a naive implementation. ASR takes 100-200ms. LLM first-token latency adds another 200-400ms. TTS synthesis adds 150-300ms. Stack those sequentially and you are looking at 450-900ms before the user hears anything.
Real-time S2S models eliminate this stacking. Audio input feeds directly into a model that begins generating audio output without the text conversion round trip. Under optimal conditions, S2S models achieve first-audio latency under 300ms.
However, the latency advantage of S2S is less absolute in practice than it appears in benchmarks. Modern streaming pipeline implementations close most of the gap. Streaming ASR begins transcribing before the user finishes speaking. Streaming LLM inference starts generating tokens immediately. Streaming TTS begins synthesizing before the full response is complete.
With full streaming across all three stages, the effective latency of a pipeline is not ASR + LLM + TTS. It approaches the maximum latency of any individual stage. For well-optimized streaming pipelines, end-to-end latency of 400-600ms is achievable in production, which is perceptible as a slight but not disruptive delay.
S2S models at 250-350ms still win on latency. The question is whether that 150-200ms difference is worth the tradeoffs it costs.
Cost: Where Pipelines Win Decisively
Real-time S2S models are significantly more expensive than cascaded pipelines in production. Speech-to-speech models run approximately 10 times the cost of chained pipelines, largely due to context accumulation where the model re-charges for all previous tokens on each turn.
In a cascaded pipeline, you pay for inference on the current user utterance. The LLM receives the conversation history as context but you control exactly what that context contains and how it is managed. Token costs grow linearly with conversation length but predictably.
In an S2S model, the entire conversation history, including the audio or internal audio representations of prior turns, accumulates in the model's context window. Every turn processes more context than the last, which means every turn costs more than the last. For long conversations in high-volume deployments, this cost structure is difficult to control and difficult to predict.
For organizations running voice agents at scale, hundreds of thousands of calls per month, the cost differential between architectures is operationally significant.
Control and Debuggability: The Pipeline Advantage
The pipeline architecture is observable at every stage boundary. You can log the ASR transcript, inspect the LLM prompt and completion, and audit the TTS input independently. When something goes wrong, you know exactly where.
In an S2S model, the path from audio input to audio output is opaque. The internal representations are inaccessible. When a response is wrong or unexpected, attributing the error to a specific cause is substantially harder.
This matters beyond developer experience. In regulated industries including financial services, healthcare, and legal applications, audit trails and explainability are compliance requirements. A system where you cannot inspect what the model received and what it concluded is difficult to audit and harder to certify.
Language and Vendor Flexibility: Pipelines by a Wide Margin
Cascaded pipelines allow independent selection of each component per language, market, or use case. You can use one ASR provider that performs best for Pakistani Urdu and a different one optimized for Gulf Arabic. You can swap the LLM for a more cost-efficient model for simple queries and a more capable one for complex reasoning tasks. TTS voices can differ by persona, language, or regional accent.
S2S models available in production today are primarily English-language or limited in multilingual support. Deploying a voice agent across multiple languages and dialects on an S2S architecture means accepting the language coverage boundaries of a single provider.
For businesses with multilingual requirements, the pipeline architecture is not just preferable. It is the only architecture that currently works.
PSTN and Telephony: Where S2S Faces Real Friction
Speech-to-speech models face WebRTC and telephony integration complexity, and 8kHz PSTN audio quality significantly reduces the benefits of speech-to-speech models. PSTN telephone networks transmit audio at 8kHz. S2S models are optimized for wideband audio and their advantages in acoustic feature preservation are largely irrelevant at PSTN quality levels.
Cascaded pipelines work equally well at narrowband telephony quality. ASR models designed for telephone audio produce accurate transcriptions at 8kHz. The pipeline architecture is telephony-native in a way that current S2S models are not.
For organizations deploying voice agents over SIP trunks, to mobile phones, and to landlines, the pipeline architecture is the substantially safer choice.
When to Choose Real-Time Speech-to-Speech
S2S is the better choice in specific, narrow conditions: consumer-facing voice applications where maximum conversational naturalness is the primary metric, WebRTC based deployments where wideband audio is available end-to-end, and use cases where emotional responsiveness and prosody preservation materially improve outcomes.
Customer service bots handling refund requests do not need sub-300ms latency with preserved emotional prosody. A consumer companion application or an emotionally intelligent healthcare intake flow might.
The Hybrid Approach: Getting the Best of Both
Several teams building production voice agents are landing on hybrid architectures. S2S handles simple, fast-turn conversational exchanges where latency is most noticeable. The cascaded pipeline handles complex reasoning, tool calls, and structured data extraction where control and auditability matter more than 150ms of latency.
This is architecturally more complex to build but more flexible to operate. The routing logic between S2S and pipeline processing can be based on utterance classification, session state, or the type of action the agent needs to take.
Decision Framework
Criteria | Pipeline Wins | S2S Wins |
Latency requirement | < 500ms acceptable | < 300ms required |
Cost sensitivity | High volume, cost-controlled | Low volume, quality-first |
Language coverage | Multilingual required | English-primary acceptable |
Telephony (PSTN/SIP) | Required | Not required |
Compliance/audit | Required | Not required |
Emotional naturalness | Secondary | Primary |
Component flexibility | Required | Not required |
For most enterprise and commercial voice agent deployments in 2025, the cascaded streaming pipeline remains the correct architectural default. Real-time S2S is not a replacement. It is a specialized tool that wins in specific contexts and loses badly in others.





