通过基于表示消除的偏好优化对大型语言模型进行去毒化
📄 中文摘要
大型语言模型(LLMs)在大规模数据上训练,可能产生有毒输出,给安全部署带来担忧。以往的防御措施,如DPO和NPO等算法,虽然降低了有害输出的可能性,但并不够稳健,容易受到对抗性提示的影响,并且容易被基于微调的再学习攻击所破坏。研究表明,这些对模型的编辑是表面的,线性探测显示有害的“方向”仍然存在于表示中。为了解决这一问题,提出了基于表示消除的偏好优化(REPO),将去毒化重新构建为一个基于标记的偏好问题。通过使用一种新颖的目标和偏好数据,强制消除有害的表示方向,从而提高模型的安全性。
📄 English Summary
Detoxifying LLMs via Representation Erasure-Based Preference Optimization
Large language models (LLMs) trained on web-scale data can produce toxic outputs, raising concerns for their safe deployment. Previous defenses, based on algorithms like DPO and NPO, reduce the likelihood of harmful continuations but are not robust; they are vulnerable to adversarial prompting and can be easily undone by fine-tuning-based relearning attacks. Research indicates that these edits to the model are superficial, as linear probing reveals that harmful 'directions' persist in the representations. To address this issue, Representation Erasure-based Preference Optimization (REPO) is proposed, reformulating detoxification as a token-level preference problem. By employing a novel objective with preference data, the approach aims to eliminate harmful representation directions, thereby enhancing the safety of the model.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等