📄 中文摘要
大语言模型推理,尤其是在生成令牌(解码)阶段,通常受到 CPU 开销的主导,而非 GPU 计算。每个解码步骤生成一个序列的单个令牌,实际的 GPU 工作(小矩阵乘法、对单个查询的注意力计算)可以在微秒内完成,但 CPU 在每次内核启动时可能会花费数十微秒用于启动管理、驱动调用和同步。由于每次变换前向传递中有数百次内核启动,这种 CPU 开销可能成为瓶颈。更糟糕的是,CPU 不仅在启动内核,还在为下一个批次准备数据,更新令牌等,这进一步加剧了性能问题。
📄 English Summary
CUDA Graphs in LLM Inference: Deep Dive
LLM inference, particularly during the token generation (decoding) phase, is often dominated by CPU overhead rather than GPU compute. Each decoding step generates a single token per sequence, with the actual GPU work (small matrix multiplications and attention over one query) finishing in microseconds. However, the CPU can spend tens of microseconds per kernel launch on bookkeeping, driver calls, and synchronization. With hundreds of kernel launches per transformer forward pass, this CPU overhead can become a bottleneck. Additionally, the CPU is also preparing data for the next batch, updating tokens, which exacerbates performance issues.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等