通过嵌入空间分离增强大型语言模型的安全性

出处: Enhancing Safety of Large Language Models via Embedding Space Separation

发布: 2026年3月24日

📄 中文摘要

大型语言模型（LLMs）在能力上取得了显著进展，但确保其对有害提示的安全性仍然是一个关键挑战。研究表明，LLMs中有害查询和安全查询的潜在表示（嵌入）通常具有线性可分性，这一特性被用于通过扰动有害查询的嵌入向安全子空间构建攻击。基于这一观察，提出了一种名为嵌入空间分离（ES2）的表示级微调方法，通过显式扩大有害和安全表示在嵌入空间中的距离来提高LLM的安全性。为了防止模型一般能力的下降，引入了一种新的机制。该方法为LLMs的安全性提供了新的思路，具有重要的应用价值。

🏷️ 相关标签

#大型语言模型 #安全性 #嵌入空间分离 #表示级微调 #有害查询

📄 English Summary

Enhancing Safety of Large Language Models via Embedding Space Separation

Large language models (LLMs) have demonstrated remarkable capabilities, yet ensuring their safety against harmful prompts remains a significant challenge. Recent findings indicate that the latent representations (embeddings) of harmful and safe queries in LLMs typically exhibit linear separability, a property that has been leveraged to construct attacks by perturbing the embeddings of harmful queries towards the safe subspace. In response to this observation, a representation-level fine-tuning approach called Embedding Space Separation (ES2) is proposed, which enhances LLM safety by explicitly increasing the distance between harmful and safe representations in the embedding space. To mitigate potential degradation of the model's general capabilities, a novel mechanism is introduced. This approach offers a fresh perspective on improving the safety of LLMs and holds significant practical implications.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Enhancing Safety of Large Language Models via Embedding Space Separation

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误