FlashAttention-T:面向张量化的注意力机制

出处: FlashAttention-T: Towards Tensorized Attention

发布: 2026年2月3日

📄 中文摘要

FlashAttention-T: Towards Tensorized Attention 是对经典 FlashAttention 系列的重大创新,由 Tri Dao 等研究者提出,发表于 ACM 会议(DOI: 10.1145/3774934.3786425)。传统 Transformer 注意力机制面临二次方计算复杂度和高内存带宽需求,尤其在长序列(如 LLM 中的数百万 token)下,HBM(高带宽内存)访问成为瓶颈,导致训练和推理效率低下。

📄 English Summary

FlashAttention-T: Towards Tensorized Attention

[FlashAttention-T: Towards Tensorized Attention](https://dl.acm.org/doi/10.1145/3774934.3786425) FlashAttention-T, introduced by Tri Dao and colleagues (ACM DOI: 10.1145/3774934.3786425), represents a pivotal advancement in the FlashAttention lineage, pushing Transformer attention mechanisms toward full hardware-native tensorization. Standard scaled dot-product attention in Transformers incurs O(N²) time and memory complexity for sequence length N, bottlenecking long-context models (e.g., LLMs with millions of tokens) due to excessive HBM (High Bandwidth Memory) traffic. Prior FlashAttention iterations addressed this via IO-aware tiling: partitioning Q, K, V into SRAM-fit blocks, online softmax computation to avoid materializing the full N² attention matrix, and gradient recomputation during backpropagation, reducing memory to O(N) and yielding 2-4x speedups with kernel fusion. The key innovation of FlashAttention-T lies in fully tensorizing the attention kernel to exploit modern GPU Tensor Cores (e.g., NVIDIA Hopper/Blackwell WMMA instructions), transforming all operations—including nonlinear softmax—into high-throughput Tensor Matrix Multiply-Accumulate (TMA) primitives. Core technical contributions include: 1) **Tensorized Softmax**: Conventional per-element exp/logsumexp is replaced by block-wise approximations via tensor GEMMs, using lookup tables for exp or low-order Taylor expansions, with causal masking via tensor broadcasting for exactness. This enables single TMA calls for softmax(QK^T / √d). 2) **Multi-Head Tensorization**: Multi-head attention (MHA) is elevated to higher-order tensors (e.g., [batch, heads, seq_len/block_len]), processing all heads in one TMA invocation, slashing kernel launch overheads by 10x compared to batched serial execution. 3) **Low-Precision TMA Fusion**: Native FP8/INT4 support with block-floating-point scaling preserves accuracy while leveraging Tensor Core peaks (e.g., 1 PFLOPS FP16 on H100). Backward pass employs 'tensorized recomputation,' recomputing only tiled tensors on-the-fly.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等