重训练是否足够?高效 MoE 压缩的路由器校准必要性

📄 中文摘要

混合专家(MoE)模型有效地扩展了容量,但其庞大的参数占用在部署时造成了内存瓶颈。将无重训练的 MoE 压缩组织为三种范式——专家剪枝、专家编辑和专家合并,研究表明,持续的后压缩性能下降主要源于一个被忽视的因素:当专家发生变化而路由器保持不变时,路由器与专家之间的不匹配。有效的无重训练压缩应避免更新专家参数,同时允许轻量级的路由器校准。为此,提出了路由器知识蒸馏(Router KD),该方法仅更新少量参数(路由器),通过蒸馏原始模型的下一个标记信息来实现。

📄 English Summary

Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

Mixture-of-Experts (MoE) models efficiently scale capacity, but their large parameter footprint creates a memory bottleneck at deployment time. This work organizes retraining-free MoE compression into three paradigms: Expert Pruning, Expert Editing, and Expert Merging. It demonstrates that persistent post-compression degradation largely arises from a neglected factor: router-expert mismatch when experts are altered while the router remains unchanged. Effective retraining-free compression should avoid updating expert parameters while allowing for lightweight router calibration. To address this, Router Knowledge Distillation (Router KD) is proposed, which updates only a small fraction of parameters (the router) by distilling the next-token information from the original model.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等