MoE-SpAc:基于推测激活效用的高效MoE推理在异构边缘场景中的应用
📄 中文摘要
Mixture-of-Experts (MoE) 模型在性能扩展方面具有优势,但在边缘设备上面临严重的内存限制。现有的卸载策略由于自回归专家激活的动态和低信息特性,难以克服I/O瓶颈。研究提出了一种新的框架MoE-SpAc,利用推测解码(SD)作为内存管理的有用前瞻传感器,而不仅仅是计算加速器。该框架集成了推测效用估计器以跟踪专家需求,异构工作负载平衡器通过在线整数优化动态划分计算,以及异步执行引擎以统一执行过程,从而提高了MoE模型在边缘设备上的推理效率。
📄 English Summary
MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios
Mixture-of-Experts (MoE) models offer scalable performance but encounter significant memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic and low-information nature of autoregressive expert activation. This research proposes a novel framework, MoE-SpAc, which repurposes Speculative Decoding (SD) not only as a compute accelerator but also as an informative lookahead sensor for memory management. The framework integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the execution process, thereby enhancing the inference efficiency of MoE models on edge devices.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等