📄 中文摘要
变换器模型在进行精细控制时表现出抗性。对被识别为关键的注意力头进行消融操作,所产生的行为变化微乎其微,因为分布式冗余能够补偿损失。这种“九头蛇效应”使得可解释性变得虚幻:虽然可以通过相关性识别组件,但无法预测或控制其因果作用。研究表明,架构干预可以揭示隐藏的模块化特性。该方法结合了双流处理,分离了标记和上下文表示,逐层监督在每个深度提供独立的梯度信号,以及门控注意力正则化,促使模型朝向离散激活模式训练。经过逐层监督训练后,模型的消融效果提高了5到23倍。
📄 English Summary
Engineering Verifiable Modularity in Transformers via Per-Layer Supervision
Transformers exhibit resistance to surgical control, where ablating an attention head deemed critical for specific tasks results in minimal behavioral changes due to distributed redundancy compensating for the damage. This Hydra effect obscures interpretability, as components can be identified through correlation but their causal roles remain unpredictable and uncontrollable. Architectural interventions are demonstrated to reveal hidden modularity. The proposed approach integrates dual-stream processing that separates token and contextual representations, per-layer supervision providing independent gradient signals at each depth, and gated attention that regularizes towards discrete activation patterns. Models trained with per-layer supervision exhibit ablation effects that are 5 to 23 times more pronounced.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等