专家推测加速混合专家的推理

出处: Speculating Experts Accelerates Inference for Mixture-of-Experts

发布: 2026年3月23日

📄 中文摘要

混合专家（MoE）模型因其在保持稀疏激活和减少每个标记计算的同时扩展大型语言模型（LLM）能力而受到广泛关注。然而，在内存受限的推理环境中，专家权重必须转移到CPU，这在解码过程中造成了CPU与GPU之间的性能瓶颈。提出了一种专家预取方案，该方案利用当前计算的内部模型表示来推测未来的专家，从而使内存传输与计算重叠。通过多种MoE架构的实验，证明了这些内部表示能够可靠地预测未来的专家。此外，执行推测的专家通常能够保持良好的性能表现。

🏷️ 相关标签

#混合专家 #推理 #专家预取 #大型语言模型 #内存约束

📄 English Summary

Speculating Experts Accelerates Inference for Mixture-of-Experts

Mixture-of-Experts (MoE) models have become increasingly popular for scaling the capacity of large language models (LLMs) while ensuring sparse activations and reduced per-token computation. However, in memory-constrained inference environments, offloading expert weights to the CPU creates a performance bottleneck due to CPU-GPU transfer delays during decoding. An expert prefetching scheme is proposed that utilizes currently computed internal model representations to speculate future experts, allowing memory transfers to overlap with computation. Experiments across various MoE architectures demonstrate that these internal representations can reliably predict future experts. Furthermore, executing the speculated experts generally maintains strong performance.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Speculating Experts Accelerates Inference for Mixture-of-Experts

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误