Voice AI Agents in Production: Vapi vs Retell vs LiveKit vs DIY (2026 Guide)
A voice AI agent that takes more than 700 milliseconds to start responding sounds broken to a human ear. Most stacks miss that target. The ones that hit it win the call. This is the vendor-neutral guide to building production voice agents in 2026 - from the latency budget physics to the Vapi-vs-Retell-vs-LiveKit-vs-DIY decision that defines your next 12 months of operations.
TL;DR - Key Takeaways
- Sub-700ms first-token latency is the threshold between "natural" and "robotic." Above 1200ms users hang up.
- The latency budget breaks down as: STT 100-200ms + LLM 200-400ms + TTS 100-200ms + network 50-100ms.
- Vapi and Retell are the fastest path to a working agent. LiveKit Agents wins for control. DIY (Twilio + Deepgram + GPT-4o + Cartesia) wins past ~10K calls/month on cost.
- Interruption handling, voicemail detection, and fallback are the three features that decide whether your agent is usable in production.
- A healthcare client doing 12,000 outbound scheduling calls per month migrated DIY -> Vapi -> back to DIY in nine months. Pick deliberately, not by hype.
The Latency Budget That Makes or Breaks Voice
Humans tolerate roughly 200-400ms of silence in conversation before it feels off. Past 700ms, the other party feels not-quite-human. Past 1200ms, they hang up. Your end-to-end latency from "user finished speaking" to "agent starts speaking" has to fit inside this window. The components are unforgiving.
STT (speech-to-text) gives you 100-200ms with a streaming model like Deepgram or AssemblyAI. The LLM gets 200-400ms - which means GPT-4o-mini or Claude Haiku, not GPT-4o on a cold start. TTS (text-to-speech) gives you 100-200ms with Cartesia, ElevenLabs Turbo, or PlayHT. Network round-trips eat 50-100ms depending on geography. There is no slack. Every architecture decision has to fight for milliseconds.
Vapi vs Retell vs LiveKit Agents vs DIY
| Stack | Time to v1 | Cost at 10K min/mo | Customization | Best For |
|---|---|---|---|---|
| Vapi | 1-3 days | ~$1,500/mo | Medium - prompt + tool config | Founders who need to ship this month |
| Retell | 1-3 days | ~$1,400/mo | Medium - similar to Vapi | Outbound-heavy use cases |
| LiveKit Agents | 1-2 weeks | ~$900/mo (infra) + LLM/STT/TTS | High - full code control | Teams with eng capacity, custom needs |
| DIY (Twilio + Deepgram + GPT-4o + Cartesia) | 3-6 weeks | ~$600/mo | Total | 10K+ minutes/month, custom orchestration |
Cost numbers are rough order-of-magnitude at 10,000 minutes per month with US numbers. Vendor pricing changes - verify before architecting around a number.
The TJ Voice Agent Latency Budget
700ms total budget from "user stops speaking" to "agent starts speaking." Allocate as follows:
STT (streaming): 150ms. Use a streaming model that returns tokens as they arrive. Deepgram Nova-2, AssemblyAI Universal-Streaming, or the platform's bundled STT.
LLM (first token): 350ms. Stream the response. Use a fast model (GPT-4o-mini, Claude Haiku, Gemini Flash). Heavier models break the budget.
TTS (first audio chunk): 150ms. Use a streaming TTS that emits audio while the LLM is still generating. Cartesia, ElevenLabs Turbo v2, or PlayHT Lightning.
Network + glue: 50ms. Co-locate services. Avoid serverless cold starts on the voice path.
If any single component blows its budget, the whole experience feels off. Optimize the slowest component first.
What Production Voice Looks Like
When to Use a Vendor vs Roll Your Own
Use Vapi or Retell if you are shipping in under a month, your call volume is under 5,000 minutes per month, your use case is standard (inbound support, outbound scheduling), and you do not yet have a strong opinion about prompt orchestration.
Use LiveKit Agents if you have at least one engineer who can own the stack, you need custom logic between turns (database lookups, multi-agent handoff, complex tool routing), and your call volume justifies a few weeks of build time.
Go DIY (Twilio + Deepgram + GPT-4o-mini + Cartesia) if your call volume is above ~10K minutes per month, you need full control over latency optimization, you have non-standard requirements (special compliance, custom telephony provider, unusual languages), or your unit economics will not survive vendor markup.
Real-World: 12K Calls/Month Healthcare Scheduler
A healthcare client running outbound appointment-reminder calls started on DIY (Twilio + their own orchestration), migrated to Vapi to ship faster after a re-org, then migrated back to DIY nine months later when call volume crossed 12K minutes per month. The Vapi bill at peak was $1,800/month; the DIY equivalent was $620 plus about 4 hours/month of ops time. The lesson: vendor stacks are right at low volume and during the "prove the use case" phase. Past a clear product-market fit, the markup stops being worth it. The right answer changes over time.
5 Common Voice Agent Mistakes
1. No interruption handling. If the user starts talking while the agent is mid-sentence, the agent must stop, listen, and respond to the new utterance. Without this, the agent talks over the user and they hang up.
2. No voicemail detection. Outbound agents waste minutes (and money) leaving "interactive" messages on voicemail. Add a voicemail classifier in the first 2 seconds.
3. No human fallback. When the agent is stuck, transfer to a human. "Stuck" means three failed turns or the user explicitly asks. This is the single biggest CSAT lever.
4. Picking the wrong LLM. GPT-4o sounds great but on a cold start it blows the latency budget. Use the fastest competent model and reserve the heavy model for offline async tasks.
5. No call transcript review. Every voice agent should ship with transcript review built in. Sample 5-10% of calls weekly. The failure modes are not in your dashboards; they are in the conversations.
Frequently Asked Questions
What does Vapi cost at scale?
Roughly $0.13-0.18 per minute all-in (their margin plus underlying STT/LLM/TTS). At 10,000 minutes per month that is $1,300-1,800. The DIY equivalent is typically $0.05-0.07 per minute. Crossover happens around 5-10K minutes/month depending on your specific stack.
Can I use Claude or Gemini for voice?
Yes. Claude Haiku and Gemini Flash both fit the latency budget. They are not natively streaming-audio (you stream text out, then send to TTS), which is the standard pattern. OpenAI Realtime API is the only stack that does true audio-to-audio without intermediate text.
What about OpenAI Realtime API?
Strong for fast prototypes. The audio-to-audio model collapses STT + LLM + TTS into one call, which simplifies the stack and reduces latency. Tradeoffs: less control over voice, fewer options for swapping components, and per-minute cost is higher than the unbundled DIY equivalent at scale.
How do I handle compliance (HIPAA, PCI) with voice?
HIPAA: most voice vendors offer a BAA on enterprise tiers. PCI: avoid sending card data through the LLM at all - use DTMF capture or transfer to a PCI-compliant payment IVR for the card-entry step. Recording retention policies need legal review, not engineering improvisation.
How long does it take to build a production voice agent?
With Vapi or Retell: a working v1 in a week, a polished v1 with handoff and fallback in 2-3 weeks. LiveKit Agents: 3-4 weeks. Full DIY: 6-10 weeks for the first agent, faster for subsequent ones. Add 2-4 weeks for compliance review in regulated industries.
Conclusion
Voice AI is the highest-leverage interface launching in 2026 - it works for users who would never type, it handles 5x more conversations per hour than humans, and it pays back in months. The teams that win are the ones that respect the latency budget, pick the right stack for their phase, and build the boring three features (interruption, voicemail, fallback) before launch.
Scoping a voice agent and not sure which stack to pick? Happy to walk through it in a free 30-minute call.
Building a Voice AI Agent?
Free 30-minute scoping call. We will pick the right stack for your volume, latency, and budget.
Book a Scoping Call