TTQ：激活感知的测试时量化以加速大规模语言模型的即时推理

出处: TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

发布: 2026年3月23日

📄 中文摘要

为了解决大型基础模型的巨大计算需求，提出了无需重新训练的激活感知压缩技术。然而，这些方法高度依赖于校准数据，因此在未见过的下游任务中可能会出现领域转移问题。研究提出了一种测试时量化（TTQ）框架，该框架在推理时动态压缩大型模型，以解决这一问题。通过高效的在线校准，TTQ能够针对每个提示进行即时的激活感知量化，无论下游任务如何，均能实现推理速度的提升。多项实验表明，TTQ在量化性能上优于现有的最先进基准。

🏷️ 相关标签

#激活感知 #测试时量化 #大规模语言模型 #推理加速

📄 English Summary

TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

To address the significant computational demands of large foundation models, activation-aware compression techniques that do not require retraining have been introduced. However, these methods heavily rely on calibration data, leading to potential domain shift issues for unseen downstream tasks. A test-time quantization (TTQ) framework is proposed, which compresses large models on the fly during inference to mitigate this problem. With efficient online calibration, TTQ enables instant activation-aware quantization for every prompt, regardless of downstream tasks, achieving inference speedup. Several experiments demonstrate that TTQ outperforms state-of-the-art baselines in quantization performance.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误