当提示优化变成越狱：大型语言模型的自适应红队测试

出处: When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models

发布: 2026年3月23日

📄 中文摘要

大型语言模型（LLMs）在高风险应用中的集成日益增加，使得可靠的安全保障成为一个重要的实际和商业关注点。现有的安全评估主要依赖于固定的有害提示集合，隐含假设对手是非自适应的，从而忽视了在输入被迭代优化以规避安全措施的现实攻击场景。本研究考察了当代语言模型对自动化对抗性提示优化的脆弱性。通过重新利用原本旨在改善良性任务表现的黑箱提示优化技术，系统性地搜索安全失效。使用DSPy，应用了三种优化器对提取的提示进行处理，旨在识别潜在的安全漏洞。

🏷️ 相关标签

#大型语言模型 #安全评估 #对抗性提示优化 #脆弱性 #黑箱技术

📄 English Summary

When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models

Large Language Models (LLMs) are increasingly integrated into high-stakes applications, making robust safety guarantees a central concern in both practical and commercial contexts. Existing safety evaluations predominantly rely on fixed collections of harmful prompts, implicitly assuming non-adaptive adversaries and thus overlooking realistic attack scenarios where inputs are iteratively refined to evade safeguards. This research examines the vulnerability of contemporary language models to automated adversarial prompt refinement. By repurposing black-box prompt optimization techniques originally designed to enhance performance on benign tasks, a systematic search for safety failures is conducted. Utilizing DSPy, three such optimizers are applied to prompts to identify potential safety vulnerabilities.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误