VMonarch：基于结构化注意力的高效视频扩散Transformer

出处: VMonarch: Efficient Video Diffusion Transformers with Structured Attention

发布: 2026年2月2日

📄 中文摘要

视频扩散Transformer（Video DiTs）中，传统注意力机制的二次复杂度严重限制了其上下文可扩展性。分析发现，Video DiTs中稀疏的时空注意力模式可以通过Monarch矩阵自然地表示。Monarch矩阵是一类具有灵活稀疏性的结构化矩阵，通过交替最小化算法能够实现亚二次复杂度的注意力计算。基于这一发现，提出了VMonarch，这是一种新颖的视频扩散模型架构，它利用Monarch矩阵的特性来解决现有Video DiTs中注意力机制的效率瓶颈。VMonarch通过将注意力操作分解为一系列更小、更高效的矩阵乘法，显著降低了计算成本，同时保持或提升了模型性能。

📄 English Summary

VMonarch: Efficient Video Diffusion Transformers with Structured Attention

The quadratic complexity inherent in the attention mechanism poses a significant limitation on the context scalability of Video Diffusion Transformers (DiTs). An observation reveals that the highly sparse spatio-temporal attention patterns prevalent in Video DiTs can be effectively represented by the Monarch matrix. This matrix family constitutes a class of structured matrices characterized by flexible sparsity, enabling sub-quadratic attention computation through an alternating minimization algorithm. Building upon this insight, VMonarch is proposed, a novel architecture designed for video diffusion models. VMonarch leverages the properties of Monarch matrices to address the efficiency bottlenecks of existing Video DiTs' attention mechanisms. By decomposing the attention operation into a series of smaller, more efficient matrix multiplications, VMonarch substantially reduces computational costs while maintaining or even enhancing model performance. Specifically, the structured nature of Monarch matrices facilitates sparse attention weighting without compromising critical information, thereby efficiently handling long video sequences. VMonarch's design allows the model to better capture both local and global dependencies within video data, circumventing the immense computational overhead typically associated with traditional self-attention mechanisms when applied to high-resolution or long-duration videos. This approach not only improves training and inference efficiency but also lays the groundwork for developing larger-scale, higher-quality video generation models in the future. Experimental results demonstrate VMonarch's superior efficiency and performance across various video generation tasks.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

VMonarch：基于结构化注意力的高效视频扩散Transformer

📄 中文摘要

🏷️ 相关标签

📄 English Summary

VMonarch: Efficient Video Diffusion Transformers with Structured Attention

🏷️ Related Tags

📄 中文摘要

🏷️ 相关标签

📄 English Summary

VMonarch: Efficient Video Diffusion Transformers with Structured Attention

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误