量化大规模语言模型注意力头的稳定性：对电路通用性的影响

出处: Quantifying LLM Attention-Head Stability: Implications for Circuit Universality

发布: 2026年2月20日

📄 中文摘要

在机制可解释性研究中，近期工作对变换器“电路”进行了深入分析，这些电路是稀疏的、单层或多层的子计算，可能反映人类可理解的功能。然而，这些网络电路在相同深度学习架构的不同实例中的稳定性很少经过严格检验。缺乏这一验证，使得报告的电路是否在不同实验室中普遍出现，或仅限于特定估计实例，仍然不明确，这可能限制在安全关键环境中的信心。本研究系统性地研究了各种规模的复杂变换器语言模型中的稳定性，逐层量化注意力头在独立重拟合中学习表示的相似程度。

🏷️ 相关标签

#大规模语言模型 #注意力头 #电路稳定性 #机制可解释性 #深度学习

📄 English Summary

Quantifying LLM Attention-Head Stability: Implications for Circuit Universality

This study systematically investigates the stability of transformer circuits, which are sparse, single or multi-layer sub-computations that may reflect human-understandable functions, in the context of mechanistic interpretability. A significant gap exists in the assessment of these circuits' stability across different instances of the same deep learning architecture. Without rigorous testing, it remains uncertain whether the reported circuits are universally applicable across various labs or are specific to particular estimation instances, which could undermine confidence in safety-critical applications. The research quantifies, layer by layer, the degree to which attention heads learn similar representations across independent refits in increasingly complex transformer language models of varying sizes.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Quantifying LLM Attention-Head Stability: Implications for Circuit Universality

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误