MoBiQuant：面向令牌自适应弹性大型语言模型的混合位量化

出处: MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs

发布: 2026年2月25日

📄 中文摘要

针对云端和边缘设备上运行时复杂度的变化，弹性大型语言模型（LLM）的部署变得愈发重要。LLM能够根据可用的计算资源以不同的量化精度进行推理。然而，量化的校准参数通常与特定的精度相关，这在弹性精度校准和运行时精度切换中带来了挑战。研究表明，校准参数的变化源于精度依赖的异常迁移现象导致的令牌级敏感性变化。基于这一观察，提出了一种新的混合位量化框架MoBiQuant，该框架能够根据不同的令牌敏感性动态调整权重精度，从而提高弹性LLM的性能和适应性。

🏷️ 相关标签

#弹性大型语言模型 #量化精度 #校准参数 #混合位量化 #令牌敏感性

📄 English Summary

MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs

The deployment of elastic large language models (LLMs) has become increasingly important due to the varying runtime complexity on cloud and edge devices. LLMs can be inferred with different quantization precisions based on available computational resources. However, it has been observed that the calibration parameters for quantization are typically tied to specific precisions, posing challenges for elastic-precision calibration and runtime precision switching. This study attributes the variation in calibration parameters to the token-level sensitivity changes caused by a precision-dependent outlier migration phenomenon. Motivated by this observation, a novel Mixture-of-Bits quantization framework, MoBiQuant, is proposed, which dynamically adjusts weight precision according to varying token sensitivities, thereby enhancing the performance and adaptability of elastic LLMs.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误