TTQ:激活感知的测试时量化以加速大规模语言模型的即时推理
📄 中文摘要
为了解决大型基础模型的巨大计算需求,提出了无需重新训练的激活感知压缩技术。然而,这些方法高度依赖于校准数据,因此在未见过的下游任务中可能会出现领域转移问题。研究提出了一种测试时量化(TTQ)框架,该框架在推理时动态压缩大型模型,以解决这一问题。通过高效的在线校准,TTQ能够针对每个提示进行即时的激活感知量化,无论下游任务如何,均能实现推理速度的提升。多项实验表明,TTQ在量化性能上优于现有的最先进基准。
📄 English Summary
TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly
To address the significant computational demands of large foundation models, activation-aware compression techniques that do not require retraining have been introduced. However, these methods heavily rely on calibration data, leading to potential domain shift issues for unseen downstream tasks. A test-time quantization (TTQ) framework is proposed, which compresses large models on the fly during inference to mitigate this problem. With efficient online calibration, TTQ enables instant activation-aware quantization for every prompt, regardless of downstream tasks, achieving inference speedup. Several experiments demonstrate that TTQ outperforms state-of-the-art baselines in quantization performance.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等