你的 Nginx 正在扼杀你的 AI 服务——为何需要重新设计流量层
📄 中文摘要
四个数字揭示了 AI 基础设施面临的核心问题:用户能够容忍的最长等待时间为 3 秒,超过此阈值用户流失显著增加;在 A100 上,70B 模型完成一次完整推理的中位时间为 47 秒;同一模型输出第一个 token 的时间为 0.3 秒;一小时 A100 GPU 的按需价格为 2.48 美元,若在凌晨 3 点闲置,则这笔费用将白白浪费。这四个数字之间的紧张关系构成了 AI 基础设施最根本的工程问题:用户需求即时响应,模型需要时间进行计算,而计算资源必须精确调度,传统流量层对此一无所知。
📄 English Summary
Your nginx Is Killing Your AI Service — Why You Need to Redesign the Traffic Layer
Four key numbers highlight the fundamental challenges in AI infrastructure: the maximum wait time users can tolerate is 3 seconds, beyond which churn increases sharply; the median time for a 70B model to complete a full inference pass on an A100 is 47 seconds; the same model takes 0.3 seconds to output its first token; and the on-demand price for one A100 GPU is $2.48 per hour, which is wasted if it remains idle at 3 AM. The tension between these numbers illustrates the core engineering problem: users demand instant responses, models require time to compute, and resources must be scheduled precisely, yet traditional traffic layers are unaware of these dynamics.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等