What is the core stack required to build a production AI voice agent in 2026?

The industry-standard stack consists of a WebRTC layer (like LiveKit) for real-time audio transport, an LLM (like GPT-4o) for reasoning and response generation, and a text-to-speech (TTS) engine to close the loop.

How do you handle turn-taking so the AI doesn’t talk over the user?

Turn-taking is managed via Voice Activity Detection (VAD). Using a lightweight model like Silero VAD, the agent can detect speech versus silence and handle interruptions by stopping its own audio stream the moment the user begins speaking.

What is the typical latency for an AI voice agent built with LiveKit and OpenAI?

On a well-configured deployment, end-to-end latency usually falls between 800ms and 1.5 seconds. This includes speech-to-text (STT), LLM reasoning, and text-to-speech (TTS) components.

Why is server-side architecture preferred over client-side for AI agents?

Server-side architecture enables the agent to run as a stable participant in a media room. It secures sensitive API keys, provides consistent compute for media processing, and allows the agent to maintain the connection even if the client's local environment fluctuates.

What infrastructure challenges arise when scaling AI voice agents?

Scaling requires managing TURN servers for firewall traversal, horizontal scaling of agent workers for concurrent calls, and constant monitoring of the media pipeline to ensure latency remains low as load increases.

How to Build an AI Voice Agent with LiveKit and OpenAI (2026 Guide)

AI voice agents stopped being a research project somewhere around late 2024. By 2026, they're in production, handling customer support calls, running sales qualification, powering healthcare intake, and replacing IVR trees that nobody liked anyway. The architecture that makes them work at low latency has largely converged around the same stack: a WebRTC layer for real-time audio transport, an LLM for reasoning and response generation, and a text-to-speech engine to close the loop.

LiveKit and OpenAI together cover most of that stack cleanly. LiveKit handles the WebRTC media infrastructure, room management, audio routing, and participant handling. OpenAI brings the speech-to-text via Whisper, the language model via GPT-4o, and text-to-speech to push audio back to the user. The integration between them, when done correctly, gets you a working voice agent with latency that doesn't feel broken.

This guide walks through how to build one. Not a toy demo, a working architecture you can extend into production.

What You're Building

By the end of this guide, you'll have a voice agent that:

Accepts an inbound audio connection from a user via a LiveKit room
Streams audio to OpenAI's Whisper model for real-time transcription
Passes the transcribed text to GPT-4o for response generation
Converts the response to speech and streams it back through LiveKit
Handles turn-taking so the agent doesn't talk over the user

The architecture is server-side. The agent runs as a LiveKit participant, an agent process that connects to the room, subscribes to audio tracks, and publishes its own audio in response.

Prerequisites

Before starting, you'll need:

Python 3.9 or later
A LiveKit server running (self-hosted or LiveKit Cloud)
An OpenAI API key with access to Whisper and GPT-4o
Basic familiarity with async Python

If you're running LiveKit locally for development, the fastest path is Docker:

Code Snippetjavascript

docker run --rm \
  -p 7880:7880 \
  -p 7881:7881 \
  -p 7882:7882/udp \
  livekit/livekit-server \
  --dev

This gives you a local LiveKit instance with a fixed API key and secret (devkey / secret) that you can use during development.

See how RTC LEAGUE manages LiveKit infrastructure

Get in Touch

Step 1, Install Dependencies

Code Snippetjavascript

pip install livekit-agents \
            livekit-plugins-openai \
            livekit-plugins-silero \
            python-dotenv

A quick note on what each package does:

livekit-agents is the framework for building server-side agents that connect to LiveKit rooms as participants. It handles the WebRTC connection, audio track subscription, and the event loop your agent runs inside.

livekit-plugins-openai wraps OpenAI's Whisper (STT) and TTS APIs into LiveKit-compatible interfaces. It handles chunked audio streaming so you're not waiting for a full utterance before processing starts.

livekit-plugins-silero provides WebRTC Voice Activity Detection, a lightweight model that detects when someone is actually speaking versus silence. This is what makes turn-taking work without the agent constantly processing dead air.

Step 2, Set Up Your Environment

Code Snippetjavascript

Create a .env file in your project root:
LIVEKIT_URL=ws://localhost:7880
LIVEKIT_API_KEY=devkey
LIVEKIT_API_SECRET=secret
OPENAI_API_KEY=your_openai_key_here

For production, replace the LiveKit values with your actual server URL and credentials. If you're using LiveKit Cloud, the URL will be in your project dashboard.

Step 3, Write the Agent

Code Snippetjavascript

Create agent.py:
import asyncio
from dotenv import load_dotenv
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli, llm
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import openai, silero

load_dotenv()

async def entrypoint(ctx: JobContext):
    # Define the agent's system prompt
    # This is where you shape the agent's persona and behavior
    initial_ctx = llm.ChatContext().append(
        role="system",
        text=(
            "You are a helpful voice assistant. "
            "Keep your responses concise, you're speaking out loud, not writing. "
            "Avoid lists, markdown, or anything that doesn't translate well to speech."
        ),
    )

    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    assistant = VoiceAssistant(
        vad=silero.VAD.load(),               # Voice activity detection
        stt=openai.STT(),                     # Whisper speech-to-text
        llm=openai.LLM(),                     # GPT-4o language model
        tts=openai.TTS(),                     # OpenAI text-to-speech
        chat_ctx=initial_ctx,
    )

    assistant.start(ctx.room)

    # Optional: have the agent greet the user when they join
    await asyncio.sleep(1)
    await assistant.say("Hey, how can I help you today?", allow_interruptions=True)


if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

This is the core agent. When it starts, it connects to any LiveKit room it's dispatched to, subscribes to the audio track from participants, and runs the STT → LLM → TTS pipeline in response to speech.

Step 4, Run the Agent

python agent.py dev

The dev flag runs the agent in development mode, connecting to your local LiveKit server and polling for jobs. When a user connects to a room, the agent automatically joins as a participant.

To test it, you can use LiveKit's web playground at https://meet.livekit.io pointed at your local server, or build a minimal frontend with the LiveKit JS SDK.

Step 5, Connect a Frontend

For a minimal browser client that connects a user to a room and lets them talk to the agent:

Code Snippetjavascript

import { Room, RoomEvent } from 'livekit-client';

const room = new Room();

await room.connect(LIVEKIT_URL, userToken);

// Enable the user's microphone
await room.localParticipant.setMicrophoneEnabled(true);

room.on(RoomEvent.TrackSubscribed, (track, publication, participant) => {
  if (track.kind === 'audio') {
    track.attach(); // Attach agent audio to play in the browser
  }
});

The userToken is a JWT you generate server-side using your LiveKit API key and secret, scoped to the specific room and participant identity. Never generate tokens in the browser, always on the server.

Step 6, Customize the Agent Behavior

The system prompt in Step 3 is where most of the behavioral customization happens. A few patterns worth knowing:

For a customer support agent:

Code Snippetjavascript

text=(
    "You are a support agent for Acme Corp. "
    "You help users troubleshoot their accounts, billing issues, and product questions. "
    "If you don't know the answer, say so clearly and offer to escalate. "
    "Keep responses under three sentences where possible."
)

For a sales qualification agent:
text=(
    "You are a sales assistant. Your job is to understand what the caller is looking for, "
    "qualify whether it's a fit, and book a meeting with the right person on the team. "
    "Ask one question at a time. Don't pitch, listen first."
)

For handling interruptions:

The allow_interruptions=True flag on assistant.say() tells the agent to stop speaking if the user starts talking. This is on by default in VoiceAssistant, but worth knowing it's controllable, some use cases (guided flows, IVR-style agents) benefit from disabling it.

Step 7, Handle Edge Cases Before Production

A few things that catch teams off-guard when moving from demo to production:

Silence detection tuning: Silero's VAD has sensitivity settings. The defaults work well for most environments, but if your users are on noisy connections or mobile networks, you'll want to adjust the threshold. Aggressive silence detection causes the agent to cut off users mid-sentence. Loose detection causes unnecessary processing of dead air.

Latency budgeting: The pipeline has three latency components: STT (Whisper processing), LLM (GPT-4o generation), and TTS (audio synthesis). Each adds time. In practice, end-to-end latency on a well-configured deployment runs between 800ms and 1.5 seconds from the user finishing a sentence to the agent starting its response. If you need tighter than that, streaming TTS, where the agent starts speaking as tokens arrive rather than waiting for the full response, helps significantly.

Conversation context management: GPT-4o has a context window. Long conversations accumulate messages. Build a context management strategy before you need one, a simple rolling window that keeps the last N turns plus the system prompt handles most cases.

TURN server configuration: Users behind corporate firewalls or on restricted mobile networks often can't establish direct WebRTC connections. Without properly configured TURN servers, those calls drop at connection or experience significant degradation. This is an infrastructure problem, not an application problem, but it's your problem once you're in production.

What This Architecture Scales Into

The agent you've built here is a single participant in a single room. The LiveKit Agents framework is built to scale horizontally, you run multiple worker processes, and LiveKit dispatches jobs to available workers. At scale, you're running a pool of agent workers behind your LiveKit server, each capable of handling one concurrent conversation.

For high-volume deployments, contact centers, inbound voice automation, anything handling hundreds of concurrent calls, the infrastructure layer underneath this matters considerably. The agent code stays largely the same. What changes is the media server configuration, the TURN infrastructure, the monitoring setup, and the operational model for keeping it running reliably.

This is the gap between a working prototype and a production voice agent platform. The code gets you to the prototype. Getting to production requires the infrastructure behind it to be handled correctly.

Where RTC LEAGUE Fits Into This

RTC LEAGUE manage the LiveKit infrastructure layer for teams building exactly this kind of architecture. The agent code is yours, the media server configuration, TURN setup, scaling, monitoring, and operational management is ours.

If you're building a voice agent product and you want the WebRTC infrastructure to be handled by people who run it specifically, rather than figuring it out alongside everything else you're shipping: