量化大规模语言模型注意力头的稳定性:对电路通用性的影响

📄 中文摘要

在机制可解释性研究中,近期工作对变换器“电路”进行了深入分析,这些电路是稀疏的、单层或多层的子计算,可能反映人类可理解的功能。然而,这些网络电路在相同深度学习架构的不同实例中的稳定性很少经过严格检验。缺乏这一验证,使得报告的电路是否在不同实验室中普遍出现,或仅限于特定估计实例,仍然不明确,这可能限制在安全关键环境中的信心。本研究系统性地研究了各种规模的复杂变换器语言模型中的稳定性,逐层量化注意力头在独立重拟合中学习表示的相似程度。

📄 English Summary

Quantifying LLM Attention-Head Stability: Implications for Circuit Universality

This study systematically investigates the stability of transformer circuits, which are sparse, single or multi-layer sub-computations that may reflect human-understandable functions, in the context of mechanistic interpretability. A significant gap exists in the assessment of these circuits' stability across different instances of the same deep learning architecture. Without rigorous testing, it remains uncertain whether the reported circuits are universally applicable across various labs or are specific to particular estimation instances, which could undermine confidence in safety-critical applications. The research quantifies, layer by layer, the degree to which attention heads learn similar representations across independent refits in increasingly complex transformer language models of varying sizes.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等