📄 中文摘要
冷启动延迟会破坏服务水平协议(SLA),因为“Pod正在运行”并不意味着“模型已准备好”。在Kubernetes与vLLM的结合中,冷启动包括镜像拉取、权重下载、张量加载到GPU内存以及热身工作(通常与CUDA图相关)。这些事件虽然发生频率低,但影响巨大,尤其是在从零扩展时,往往主导p95/p99的表现。部署新的vLLM版本后,水平自动扩展(HPA)迅速扩展Pod,流量转移时,p50表现良好,但p99却急剧上升。用户被路由到仍在进行模型加载和热身的实例上,这并不是一个bug,而是物理和调度的结果。如果在GPU集群上运行严格的SLA,冷启动问题将显得尤为突出。
📄 English Summary
Cold Starts, Model Loading, and Their Impact on Latency SLAs
Cold start latency disrupts Service Level Agreements (SLAs) because 'pod is Running' does not equate to 'model is ready.' In the integration of Kubernetes with vLLM, cold starts encompass image pulls, weight downloads, tensor loading into GPU memory, and warm-up tasks, often related to CUDA graphs. These events are infrequent but have a significant impact, particularly when scaling from zero, dominating the p95/p99 metrics. After deploying a new vLLM revision, the Horizontal Pod Autoscaler (HPA) scales up quickly, and while p50 metrics appear fine, p99 metrics can spike dramatically. Users may be routed to instances that are still undergoing model loading and warm-up, which is not a bug but rather a consequence of physical constraints and orchestration. Running strict SLAs on a GPU fleet highlights the challenges posed by cold start issues.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等