通过放弃 OpenAI 实现 375ms 语音到语音延迟的经验
📄 中文摘要
在为医疗客户运营 AI 自动化代理的过程中,作者面临着严重的延迟问题。使用标准技术栈(Twilio $o$ Vapi/Retell $o$ GPT-4o $o$ ElevenLabs)构建语音代理时,延迟通常在 800ms 到 1200ms 之间,导致用户体验不佳。每当音频数据需要离开服务器传输到 OpenAI 或 ElevenLabs 时,都会损失约 200ms 的时间。为了降低延迟,作者采取了购买专用硬件的极端措施,转向了 NVIDIA Blackwell 的裸金属解决方案,从而显著提高了语音交互的响应速度。通过这种方式,作者成功将延迟降低至 375ms,改善了用户体验。
📄 English Summary
How I hit 375ms Voice-to-Voice latency by ditching OpenAI for Bare Metal NVIDIA Blackwells
The author, running an AI automation agency for healthcare clients, faced significant latency issues while building voice agents using a standard tech stack (Twilio → Vapi/Retell → GPT-4o → ElevenLabs). The latency typically ranged from 800ms to 1200ms, resulting in a poor user experience where users would interrupt the bot, causing it to continue speaking for a second before realizing. Each time audio left the server to reach OpenAI or ElevenLabs, approximately 200ms was lost. To combat this, the author made the drastic decision to purchase dedicated hardware, transitioning to a Bare Metal NVIDIA Blackwell solution, which successfully reduced latency to 375ms and improved user interaction significantly.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等