Delta-Crosscoder:窄微调环境下稳健的跨编码器模型差异化
📄 中文摘要
模型差异化方法旨在识别微调如何改变模型的内部表征。跨编码器通过学习基础模型与微调模型之间可解释的潜在方向共享字典来实现这一目标。然而,现有的公式在窄微调的情况下表现不佳,因为行为变化是局部且不对称的。Delta-Crosscoder结合了BatchTopK稀疏性和基于差异的损失,优先考虑模型之间变化的方向,同时通过匹配输入的配对激活提供隐式对比信号。在10个模型生物体上进行评估,包括合成虚假事实、新兴不对齐、潜意识学习和禁忌词猜测(Gemma、LLaMA、Qwen;参数范围1B-9B),Delta-Crosscoder展示了其在复杂微调场景中的有效性。
📄 English Summary
Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes
Model diffing methods aim to identify how fine-tuning alters a model's internal representations. Crosscoders achieve this by learning shared dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. Delta-Crosscoder combines BatchTopK sparsity with a delta-based loss that prioritizes directions that change between models, along with an implicit contrastive signal derived from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder demonstrates its effectiveness in complex fine-tuning scenarios.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等