Delta-Crosscoder：窄微调环境下稳健的跨编码器模型差异化

出处: Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

发布: 2026年3月6日

📄 中文摘要

模型差异化方法旨在识别微调如何改变模型的内部表征。跨编码器通过学习基础模型与微调模型之间可解释的潜在方向共享字典来实现这一目标。然而，现有的公式在窄微调的情况下表现不佳，因为行为变化是局部且不对称的。Delta-Crosscoder结合了BatchTopK稀疏性和基于差异的损失，优先考虑模型之间变化的方向，同时通过匹配输入的配对激活提供隐式对比信号。在10个模型生物体上进行评估，包括合成虚假事实、新兴不对齐、潜意识学习和禁忌词猜测（Gemma、LLaMA、Qwen；参数范围1B-9B），Delta-Crosscoder展示了其在复杂微调场景中的有效性。

🏷️ 相关标签

#模型差异化 #微调 #跨编码器 #潜在方向 #稀疏性

📄 English Summary

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

Model diffing methods aim to identify how fine-tuning alters a model's internal representations. Crosscoders achieve this by learning shared dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. Delta-Crosscoder combines BatchTopK sparsity with a delta-based loss that prioritizes directions that change between models, along with an implicit contrastive signal derived from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder demonstrates its effectiveness in complex fine-tuning scenarios.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误