Latency Is Not One Number
When someone says "sub-900ms latency," they are usually quoting end-to-end perceived delay, the gap between a speaker's mouth and a listener's ear. That number is actually the sum of several independent delays: network propagation, packet processing at each hop, jitter buffer compensation, codec encode/decode time, and any OS-level scheduling overhead on the media thread.
Cloud infrastructure controls some of these and not others. You cannot reduce the speed of light. You can, however, reduce geographic distance to media relays, cut processing hops, optimize instance placement, and design routing that avoids congested peering points. That is what low-latency cloud infrastructure design actually means in practice.
This post covers the architectural patterns that consistently reduce latency in production real-time communication systems, and the tradeoffs each one carries.
The Geographic Routing Problem
The single largest source of avoidable latency in cloud RTC deployments is routing media traffic through a centralized region. A user in Singapore calling a media server in us-east-1 adds roughly 500ms of one-way propagation delay before any processing happens. That alone blows a 900ms budget.
The solution is a Points of Presence (PoP) architecture, a distributed set of edge media nodes, each handling sessions for users in their geographic vicinity. Callers connect to the nearest PoP, which relays media (and in some cases handles processing directly) while signaling may travel to a centralized control plane.
The key design decision is how much to do at the edge versus centralizing. Options include:
Pure relay PoPs: The edge node is a TURN/STUN relay only. All media processing, recording, and AI pipeline work happens in a central region. Good for simplicity; bad for AI voice latency because ASR/TTS round-trip times still cross continents.
Regional processing PoPs: Each PoP runs SFU capacity and media processing. AI inference may still be centralized if GPU capacity is expensive to distribute. This is the most common pattern for mid-scale deployments.
Full edge processing: ASR and TTS inference runs at the edge, co-located with the media relay. Lowest latency possible, but highest infrastructure cost and operational complexity. Viable when GPU spot instances are available at the target edge region.
RTC LEAGUE operates 10+ PoPs globally with sub-999ms round-trip capability, which supports regional processing deployments without requiring full edge inference stacks.
Anycast vs. GeoDNS vs. Latency-Based Routing
Getting users to the right PoP is a routing problem, and there are three common approaches.
GeoDNS resolves a hostname to different IPs based on the client's geographic origin, as determined by their DNS resolver's IP. It is fast to implement, widely supported, and handles coarse geographic routing well. The failure mode is when the DNS resolver is not in the same region as the client. Mobile users on carrier resolvers sometimes get routed to the wrong PoP.
Anycast announces the same IP prefix from multiple locations. BGP routing naturally sends traffic to the topologically closest announcement point. This is how Cloudflare and most large CDNs work. Anycast is more accurate than GeoDNS because it operates at the routing layer, not the DNS layer, but it requires BGP infrastructure that most teams cannot self-manage.
Latency-based routing (AWS Route 53, GCP Traffic Director) measures actual latency from DNS query origin to each available endpoint and routes accordingly. More accurate than GeoDNS and simpler than anycast. Works well when your PoPs are in the same clouds as your users , which is true for most enterprise deployments.
For WebRTC specifically, ICE candidate gathering means clients will try multiple paths regardless. The PoP routing matters primarily for TURN relay selection and signaling server affinity.
TURN Server Architecture at Scale
TURN (Traversal Using Relays around NAT) is the fallback relay mechanism for WebRTC when direct peer connections fail. In corporate networks and cellular environments, TURN relay usage can exceed 30% of sessions. That means TURN capacity is not an optional infrastructure.
TURN is bandwidth-heavy. Every byte of media passes through the relay twice, once inbound , once outbound. A 50 kbps Opus audio stream becomes 100 kbps of TURN bandwidth. At 1,000 concurrent sessions, that is 100 Mbps of TURN traffic, not counting overhead.
Scaling TURN horizontally requires session-aware load balancing. A naive TCP load balancer breaks TURN because TURN allocations are stateful; a client's ALLOCATE request & subsequent Send/Data indications must reach the same TURN server instance. The Options include:
IP hash load balancing, route by client source IP, which keeps a client's session on one instance. Works until an instance fails.
TURN server with embedded state, some TURN server implementations (coturn with Redis-backed state) can share allocation state across instances, enabling true stateless load balancing.
Edge-pinned TURN, assign TURN endpoint addresses per session during signaling, bypassing load balancers entirely: more complex signaling logic, but the cleanest failure isolation.
Monitor the TURN allocation failure rate separately from the connection failure rate. Allocation failures mean capacity exhaustion; connection failures mean NAT traversal issues.
SFU vs. MCU: The Capacity and Latency Tradeoff
For multi-party sessions, the choice between the Selective Forwarding Unit (SFU) , & the Multipoint Control Unit (MCU) architecture directly affects both latency and infrastructure cost.
An MCU decodes all incoming streams and re-encodes a composite output. This is compute-intensive (every stream requires a full decode/encode cycle) , but produces a single stream per recipient regardless of participant count. MCU architecture made sense when client devices were compute-constrained.
An SFU forwards encoded media streams without decoding. It does not touch the payload, only the RTP headers. This is far more efficient per stream, scales horizontally easily, & does not add encode/decode latency. The tradeoff is that clients receive multiple streams and must handle layout rendering themselves.
For AI voice agents, MCU vs. SFU matters differently. If your agent is a bot participant in a conference, it receives one or more audio streams and needs to mix or select the dominant speaker. An SFU delivers separate streams per participant; the agent's audio pipeline handles mixing. This is generally preferable because it gives the ASR a clean signal per speaker rather than a mixed composite.
Autoscaling Media Servers: What Works and What Breaks
Media servers (Mediasoup, Janus, Jitsi Videobridge, LiveKit, Ant Media) are stateful in a way that HTTP application servers are not. A WebRTC session has the ICE state, DTLS keys, & an RTP sequence context that lives on a specific server instance. You cannot terminate a session on one instance and resume it on another without the client renegotiating.
This makes standard cloud autoscaling (terminate instances when CPU drops) dangerous. Terminating an instance mid-session drops every active call on that instance.
Drain-based scaling is the correct pattern. When an instance is scheduled for termination, mark it as unavailable for new sessions while it continues serving existing ones. New connections route to other instances. Once all sessions on the draining instance end naturally, terminate it. This requires a session-aware health check endpoint that load balancers can query.
Session count autoscaling outperforms CPU-based autoscaling for media servers. CPU load on a media server correlates with codec complexity, not just session count. A server handling 50 Opus audio sessions may show 5% CPU, add 10 VP8 video sessions & CPU jumps to 60%. Session count with codec weighted capacity math, gives more predictable scaling triggers than CPU thresholds.
Pre-warming, provision instances before traffic arrives, not in response to it. Media server startup time (loading ICE libraries, binding UDP ports, initializing DTLS contexts) can take 10–15 seconds. If you scale reactively, new sessions queue during scale-out. Pre-warm to 60–70% capacity and scale at 50%.
Jitter Buffer Configuration and Its Effect on Perceived Latency
Jitter buffers compensate for network packet reordering and delay variation. They hold incoming RTP packets briefly , sort them & deliver them in order. The buffer depth directly trades off against latency: a deeper buffer tolerates more jitter but adds constant delay.
Most WebRTC implementation use adaptive jitter buffers that track network conditions per session. For real-time voice, the target jitter buffer depth should be below 40ms in good network conditions. Above 60ms , the call starts feeling sluggish even with good MOS scores.
For AI agent deployments where the agent is generating TTS audio on a media server (not receiving from a network peer), jitter buffers are less relevant on the TTS playback side. They matter on the inbound microphone stream. Configure your WebRTC stack to expose jitter buffer depth per session in your monitoring, persistent values above 80ms indicate the user's network path needs attention not your infrastructure.
Cloud Region and Instance Selection for RTC Workloads
Not all cloud regions & instance types perform equally for real-time workloads. A few observations from production deployments:
Avoid burstable instances (T-series on AWS, E2 on GCP) for media servers. CPU credits run out during traffic spikes & producing exactly the worst latency at the worst time.
Prefer compute-optimized instances for SFU/MCU workloads (AWS C6i , GCP C2). High single core clock speed matters more than core count for theRTP processing. Network throughput baseline matters & it confirms your instance type's guaranteed baseline not the burst ceiling.
Dedicated tenancy or bare metal is worth the cost for high density deployments. Noisy neighbor effects on shared hypervisors cause microsecond scale CPU scheduling delays that accumulate into jitter. At 1,000+ concurrent sessions per host then this becomes measurable.
Placement groups (AWS cluster placement groups, GCP compact placement) reduce inter instance latency for deployments where media servers need to communicate, cascaded SFUs for example.
Observability: Measuring What Actually Matters
Low-latency cloud infrastructure without observability is infrastructure you cannot trust. The metrics that actually matter for RTC workloads, these are:
Per-session RTT, RTCP Sender Reports, and Receiver Reports carry timing data that lets you calculate round-trip time per session. Aggregate this by PoP and region to identify geographic outliers.
End-to-end media delay, separate from network RTT. Includes jitter buffer depth + encode/decode time + processing delay. Hard to measure passively; requires active probing or injecting timing metadata into your media pipeline.
The ICE candidate type distribution, the ratio of host/srflx/relay connections tells you the TURN relay load and NAT traversal success rate. A spike in relay connections indicates a TURN capacity or configuration issue.
TURN allocation latency, measured from ALLOCATE request to Allocate Success response. Over 100ms means your TURN instances are overloaded or geographically wrong for that client.
GCP, AWS, and Azure all provide network performance insights at the VPC level, but these do not capture application-layer jitter or per-session quality. Instrument your media servers with RTCP processing and export metrics to your observability stack.






