📄 中文摘要
混合专家(MoE)模型因其在保持稀疏激活和减少每个标记计算的同时扩展大型语言模型(LLM)能力而受到广泛关注。然而,在内存受限的推理环境中,专家权重必须转移到CPU,这在解码过程中造成了CPU与GPU之间的性能瓶颈。提出了一种专家预取方案,该方案利用当前计算的内部模型表示来推测未来的专家,从而使内存传输与计算重叠。通过多种MoE架构的实验,证明了这些内部表示能够可靠地预测未来的专家。此外,执行推测的专家通常能够保持良好的性能表现。
📄 English Summary
Speculating Experts Accelerates Inference for Mixture-of-Experts
Mixture-of-Experts (MoE) models have become increasingly popular for scaling the capacity of large language models (LLMs) while ensuring sparse activations and reduced per-token computation. However, in memory-constrained inference environments, offloading expert weights to the CPU creates a performance bottleneck due to CPU-GPU transfer delays during decoding. An expert prefetching scheme is proposed that utilizes currently computed internal model representations to speculate future experts, allowing memory transfers to overlap with computation. Experiments across various MoE architectures demonstrate that these internal representations can reliably predict future experts. Furthermore, executing the speculated experts generally maintains strong performance.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等