What is the difference between a pipeline voice agent and a real-time voice agent?

A pipeline voice agent processes audio through ASR, LLM, and TTS stages sequentially. A real-time voice agent uses a speech-to-speech model that processes audio input and produces audio output directly, without a text intermediate stage.

Which voice agent architecture has lower latency?

Real-time S2S models typically achieve 250-350ms latency. Streaming pipeline implementations achieve 400-600ms. The 150-200ms gap is real but whether it justifies S2S tradeoffs in cost, control, and multilingual support depends on the specific use case.

Why are real-time voice agents more expensive?

S2S models accumulate context across turns and charge for all prior conversation tokens on each new turn. This produces costs roughly 10 times higher than cascaded pipeline deployments at scale.

Can real-time voice agents work over PSTN phone calls?

PSTN operates at 8kHz narrowband, eliminating most S2S acoustic advantages. Cascaded pipeline architectures perform equally at PSTN quality and are the standard choice for telephony and SIP trunk deployments.

What is a hybrid voice agent architecture?

A hybrid uses real-time S2S for fast conversational turns and falls back to a cascaded pipeline for complex reasoning and tool calls. It combines low latency for simple exchanges with full control and auditability for complex ones.

Pipeline vs. Realtime Voice Agent Architecture: Full Guide

Every team building a AI voice agent hits the same architectural fork early in the project: cascaded pipeline or real-time speech-to-speech model. The marketing around both options tends toward absolutes. Real-time is faster. Pipelines are more controllable. Neither statement is wrong, and neither is the full picture.

The architecture you choose shapes your latency, your cost structure, your debugging experience, your compliance posture, and your flexibility to swap components as the market moves. This post lays out both approaches honestly, including where each one actually breaks down.

Understanding the Two Architectures

Both architectures share the same input and output: audio in, audio out. What sits between them is fundamentally different.

The Cascaded Pipeline (STT-LLM-TTS)

The cascaded pipeline processes voice through three sequential stages. An ASR (Automatic Speech Recognition) model converts the user's audio to text. The text goes to a large language model for reasoning and response generation. The generated text then passes to a TTS (Text-to-Speech) engine, which synthesizes audio and sends it back to the user.

Each stage is a discrete, independently swappable component. You choose your ASR provider, your LLM, and your TTS voice independently based on accuracy, cost, language support, and latency requirements for each layer.

The Real-Time Speech-to-Speech Model

Speech-to-speech (S2S) models take audio input directly and produce audio output without converting to text in between. OpenAI's Realtime API and Google Gemini 2.5 Flash Live are the most prominent implementations. The model processes audio natively, preserving acoustic information, emotional tone, and prosody that a text-based pipeline discards at the ASR stage.

There is no visible handoff between stages. The model handles the full input-to-output path as a single system.

Deploy Voice Agents That Actually Scale

Get a Free Infrastructure Audit

Latency: Where Real-Time Models Have a Genuine Advantage

The cascaded pipeline's structural weakness is accumulation. Each stage completes before the next begins in a naive implementation. ASR takes 100-200ms. LLM first-token latency adds another 200-400ms. TTS synthesis adds 150-300ms. Stack those sequentially and you are looking at 450-900ms before the user hears anything.

Real-time S2S models eliminate this stacking. Audio input feeds directly into a model that begins generating audio output without the text conversion round trip. Under optimal conditions, S2S models achieve first-audio latency under 300ms.

However, the latency advantage of S2S is less absolute in practice than it appears in benchmarks. Modern streaming pipeline implementations close most of the gap. Streaming ASR begins transcribing before the user finishes speaking. Streaming LLM inference starts generating tokens immediately. Streaming TTS begins synthesizing before the full response is complete.

With full streaming across all three stages, the effective latency of a pipeline is not ASR + LLM + TTS. It approaches the maximum latency of any individual stage. For well-optimized streaming pipelines, end-to-end latency of 400-600ms is achievable in production, which is perceptible as a slight but not disruptive delay.

S2S models at 250-350ms still win on latency. The question is whether that 150-200ms difference is worth the tradeoffs it costs.

Cost: Where Pipelines Win Decisively

Real-time S2S models are significantly more expensive than cascaded pipelines in production. Speech-to-speech models run approximately 10 times the cost of chained pipelines, largely due to context accumulation where the model re-charges for all previous tokens on each turn.

In a cascaded pipeline, you pay for inference on the current user utterance. The LLM receives the conversation history as context but you control exactly what that context contains and how it is managed. Token costs grow linearly with conversation length but predictably.

In an S2S model, the entire conversation history, including the audio or internal audio representations of prior turns, accumulates in the model's context window. Every turn processes more context than the last, which means every turn costs more than the last. For long conversations in high-volume deployments, this cost structure is difficult to control and difficult to predict.

For organizations running voice agents at scale, hundreds of thousands of calls per month, the cost differential between architectures is operationally significant.

Control and Debuggability: The Pipeline Advantage

The pipeline architecture is observable at every stage boundary. You can log the ASR transcript, inspect the LLM prompt and completion, and audit the TTS input independently. When something goes wrong, you know exactly where.

In an S2S model, the path from audio input to audio output is opaque. The internal representations are inaccessible. When a response is wrong or unexpected, attributing the error to a specific cause is substantially harder.

This matters beyond developer experience. In regulated industries including financial services, healthcare, and legal applications, audit trails and explainability are compliance requirements. A system where you cannot inspect what the model received and what it concluded is difficult to audit and harder to certify.

Language and Vendor Flexibility: Pipelines by a Wide Margin

Cascaded pipelines allow independent selection of each component per language, market, or use case. You can use one ASR provider that performs best for Pakistani Urdu and a different one optimized for Gulf Arabic. You can swap the LLM for a more cost-efficient model for simple queries and a more capable one for complex reasoning tasks. TTS voices can differ by persona, language, or regional accent.

S2S models available in production today are primarily English-language or limited in multilingual support. Deploying a voice agent across multiple languages and dialects on an S2S architecture means accepting the language coverage boundaries of a single provider.

For businesses with multilingual requirements, the pipeline architecture is not just preferable. It is the only architecture that currently works.

PSTN and Telephony: Where S2S Faces Real Friction

Speech-to-speech models face WebRTC and telephony integration complexity, and 8kHz PSTN audio quality significantly reduces the benefits of speech-to-speech models. PSTN telephone networks transmit audio at 8kHz. S2S models are optimized for wideband audio and their advantages in acoustic feature preservation are largely irrelevant at PSTN quality levels.

Cascaded pipelines work equally well at narrowband telephony quality. ASR models designed for telephone audio produce accurate transcriptions at 8kHz. The pipeline architecture is telephony-native in a way that current S2S models are not.

For organizations deploying voice agents over SIP trunks, to mobile phones, and to landlines, the pipeline architecture is the substantially safer choice.

Deploy Voice Agents That Actually Scale

Get a Free Infrastructure Audit

When to Choose Real-Time Speech-to-Speech

S2S is the better choice in specific, narrow conditions: consumer-facing voice applications where maximum conversational naturalness is the primary metric, WebRTC based deployments where wideband audio is available end-to-end, and use cases where emotional responsiveness and prosody preservation materially improve outcomes.

Customer service bots handling refund requests do not need sub-300ms latency with preserved emotional prosody. A consumer companion application or an emotionally intelligent healthcare intake flow might.

The Hybrid Approach: Getting the Best of Both

Several teams building production voice agents are landing on hybrid architectures. S2S handles simple, fast-turn conversational exchanges where latency is most noticeable. The cascaded pipeline handles complex reasoning, tool calls, and structured data extraction where control and auditability matter more than 150ms of latency.

This is architecturally more complex to build but more flexible to operate. The routing logic between S2S and pipeline processing can be based on utterance classification, session state, or the type of action the agent needs to take.

Decision Framework

Criteria	Pipeline Wins	S2S Wins
Latency requirement	< 500ms acceptable	< 300ms required
Cost sensitivity	High volume, cost-controlled	Low volume, quality-first
Language coverage	Multilingual required	English-primary acceptable
Telephony (PSTN/SIP)	Required	Not required
Compliance/audit	Required	Not required
Emotional naturalness	Secondary	Primary
Component flexibility	Required	Not required

For most enterprise and commercial voice agent deployments in 2025, the cascaded streaming pipeline remains the correct architectural default. Real-time S2S is not a replacement. It is a specialized tool that wins in specific contexts and loses badly in others.

Pipeline vs. Realtime: Which Voice Agent Architecture Actually Wins in Production?

Understanding the Two Architectures

The Cascaded Pipeline (STT-LLM-TTS)

The Real-Time Speech-to-Speech Model

Deploy Voice Agents That Actually Scale

Latency: Where Real-Time Models Have a Genuine Advantage

Cost: Where Pipelines Win Decisively

Control and Debuggability: The Pipeline Advantage

Language and Vendor Flexibility: Pipelines by a Wide Margin

PSTN and Telephony: Where S2S Faces Real Friction

Deploy Voice Agents That Actually Scale

When to Choose Real-Time Speech-to-Speech

The Hybrid Approach: Getting the Best of Both

Decision Framework

Company

Products

Insights