通过嵌入空间分离增强大型语言模型的安全性

📄 中文摘要

大型语言模型(LLMs)在能力上取得了显著进展,但确保其对有害提示的安全性仍然是一个关键挑战。研究表明,LLMs中有害查询和安全查询的潜在表示(嵌入)通常具有线性可分性,这一特性被用于通过扰动有害查询的嵌入向安全子空间构建攻击。基于这一观察,提出了一种名为嵌入空间分离(ES2)的表示级微调方法,通过显式扩大有害和安全表示在嵌入空间中的距离来提高LLM的安全性。为了防止模型一般能力的下降,引入了一种新的机制。该方法为LLMs的安全性提供了新的思路,具有重要的应用价值。

📄 English Summary

Enhancing Safety of Large Language Models via Embedding Space Separation

Large language models (LLMs) have demonstrated remarkable capabilities, yet ensuring their safety against harmful prompts remains a significant challenge. Recent findings indicate that the latent representations (embeddings) of harmful and safe queries in LLMs typically exhibit linear separability, a property that has been leveraged to construct attacks by perturbing the embeddings of harmful queries towards the safe subspace. In response to this observation, a representation-level fine-tuning approach called Embedding Space Separation (ES2) is proposed, which enhances LLM safety by explicitly increasing the distance between harmful and safe representations in the embedding space. To mitigate potential degradation of the model's general capabilities, a novel mechanism is introduced. This approach offers a fresh perspective on improving the safety of LLMs and holds significant practical implications.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等