当提示优化变成越狱:大型语言模型的自适应红队测试

📄 中文摘要

大型语言模型(LLMs)在高风险应用中的集成日益增加,使得可靠的安全保障成为一个重要的实际和商业关注点。现有的安全评估主要依赖于固定的有害提示集合,隐含假设对手是非自适应的,从而忽视了在输入被迭代优化以规避安全措施的现实攻击场景。本研究考察了当代语言模型对自动化对抗性提示优化的脆弱性。通过重新利用原本旨在改善良性任务表现的黑箱提示优化技术,系统性地搜索安全失效。使用DSPy,应用了三种优化器对提取的提示进行处理,旨在识别潜在的安全漏洞。

📄 English Summary

When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models

Large Language Models (LLMs) are increasingly integrated into high-stakes applications, making robust safety guarantees a central concern in both practical and commercial contexts. Existing safety evaluations predominantly rely on fixed collections of harmful prompts, implicitly assuming non-adaptive adversaries and thus overlooking realistic attack scenarios where inputs are iteratively refined to evade safeguards. This research examines the vulnerability of contemporary language models to automated adversarial prompt refinement. By repurposing black-box prompt optimization techniques originally designed to enhance performance on benign tasks, a systematic search for safety failures is conducted. Utilizing DSPy, three such optimizers are applied to prompts to identify potential safety vulnerabilities.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等