超越轮询:为大型语言模型构建一个令牌感知的负载均衡器

📄 中文摘要

在以往的实验中,尝试通过并行向多个大型语言模型(LLM)发送相同请求并返回最快响应来找到最佳模型。然而,这种方法导致了多个服务器的GPU资源浪费。为了解决这一问题,提出了一种新的负载均衡器设计,旨在在请求到达时就选择合适的后端模型,而不是让它们相互竞争。这种方法能够提高响应速度,同时减少资源消耗,从而更有效地利用计算资源。

📄 English Summary

# Beyond Round Robin: Building a Token-Aware Load Balancer for LLMs

In previous experiments, sending the same request to multiple large language models (LLMs) in parallel to find the best model resulted in wasted GPU cycles across multiple servers. To address this issue, a new load balancer design is proposed that selects the appropriate backend model upfront rather than allowing them to compete against each other. This approach aims to enhance response speed while minimizing resource consumption, thereby utilizing computational resources more efficiently.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等