动量注意力:上下文学习的物理学与机制可解释性的谱取证

📄 中文摘要

“动量注意力”框架将Transformer的注意力机制与物理学中的动量概念相结合,以深入理解其上下文学习(ICL)能力。研究发现,Transformer在ICL过程中,通过注意力机制模拟了粒子在势场中的运动,其中查询、键、值向量分别对应粒子的位置、动量和能量。这种物理类比揭示了ICL的内在动力学,即模型如何通过迭代更新内部状态来学习和应用新模式。进一步引入的“谱取证”方法,通过分析注意力矩阵的谱特性,量化并可视化模型在不同任务和数据分布下的学习行为。谱取证能够识别模型内部的关键计算模式和信息流,为机制可解释性提供了新的工具。实验结果表明,动量注意力框架不仅能有效解释Transformer的ICL现象,还能通过谱分析揭示模型在不同层级和注意力头中的功能分化。这项工作为理解大型语言模型的涌现能力提供了物理学视角,并为未来设计更高效、可解释的模型奠定基础。

📄 English Summary

Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

This paper introduces the “Momentum Attention” framework, integrating Transformer's attention mechanism with the physical concept of momentum to deeply understand its in-context learning (ICL) capabilities. We find that during ICL, Transformers simulate particle motion in a potential field via attention, where query, key, and value vectors correspond to particle position, momentum, and energy, respectively. This physical analogy illuminates the intrinsic dynamics of ICL, explaining how models iteratively update internal states to learn and apply new patterns. The paper further introduces “spectral forensics,” a method that quantifies and visualizes model learning behavior across various tasks and data distributions by analyzing the spectral properties of attention matrices. Spectral forensics identifies critical computational patterns and information flow within the model, offering a novel tool for mechanistic interpretability. Experimental results demonstrate that the Momentum Attention framework not only effectively explains Transformer's ICL phenomena but also reveals functional differentiation across different layers and attention heads through spectral analysis. This work provides a physics-based perspective for understanding the emergent capabilities of large language models and lays a foundation for designing more interpretable and efficient AI models in the future.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等