MoBiQuant:面向令牌自适应弹性大型语言模型的混合位量化

📄 中文摘要

针对云端和边缘设备上运行时复杂度的变化,弹性大型语言模型(LLM)的部署变得愈发重要。LLM能够根据可用的计算资源以不同的量化精度进行推理。然而,量化的校准参数通常与特定的精度相关,这在弹性精度校准和运行时精度切换中带来了挑战。研究表明,校准参数的变化源于精度依赖的异常迁移现象导致的令牌级敏感性变化。基于这一观察,提出了一种新的混合位量化框架MoBiQuant,该框架能够根据不同的令牌敏感性动态调整权重精度,从而提高弹性LLM的性能和适应性。

📄 English Summary

MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs

The deployment of elastic large language models (LLMs) has become increasingly important due to the varying runtime complexity on cloud and edge devices. LLMs can be inferred with different quantization precisions based on available computational resources. However, it has been observed that the calibration parameters for quantization are typically tied to specific precisions, posing challenges for elastic-precision calibration and runtime precision switching. This study attributes the variation in calibration parameters to the token-level sensitivity changes caused by a precision-dependent outlier migration phenomenon. Motivated by this observation, a novel Mixture-of-Bits quantization framework, MoBiQuant, is proposed, which dynamically adjusts weight precision according to varying token sensitivities, thereby enhancing the performance and adaptability of elastic LLMs.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等