Attn-QAT: 4位量化感知训练的注意力机制

📄 中文摘要

实现可靠的4位注意力是新兴FP4计算能力GPU上端到端FP4计算的前提,但由于FP4的动态范围极小以及注意力机制的重尾激活,注意力仍然是主要障碍。该研究首次系统性地研究了针对注意力的4位量化感知训练(QAT)。发现简单地将FP4前向传递与高精度Flash Attention(FA)风格的反向传递结合的“直接插入”QAT会导致训练不稳定。识别出稳定FP4注意力的两个关键原则:1)在反向传递中匹配低精度的注意力分数重计算;2)解决FA梯度计算中的隐式精度假设。基于这些见解,提出了相应的解决方案。

📄 English Summary

Attn-QAT: 4-Bit Attention With Quantization-Aware Training

Achieving reliable 4-bit attention is essential for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains a significant obstacle due to FP4's limited dynamic range and the heavy-tailed activations associated with attention. This research presents the first systematic study of 4-bit quantization-aware training (QAT) for attention mechanisms. It was found that a naive 'drop-in' QAT approach, which combines an FP4 forward pass with a high-precision Flash Attention (FA)-style backward pass, leads to training instability. Two key principles for stable FP4 attention were identified: (1) matching low-precision recomputation of attention scores in the backward pass, and (2) addressing implicit precision assumptions in FA's gradient calculation. Based on these insights, corresponding solutions are proposed.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等