📄 中文摘要
大型语言模型(LLMs)在部署中面临越狱攻击的严峻挑战,这些攻击旨在规避安全对齐,诱导模型生成有害内容。现有防御措施通常基于启发式方法或特定攻击模式,缺乏对越狱攻击根本机制的深入理解。本文首次从因果角度系统地分析了越狱攻击的本质。我们提出了一种因果图模型来表征LLM、提示、越狱行为和有害输出之间的关系。通过识别并干预关键因果路径,我们揭示了越狱攻击成功的核心在于通过特定提示操纵LLM的内部状态,从而绕过安全约束。基于这一因果洞察,我们开发了两种新型攻击策略:因果越狱攻击(CJA)和因果越狱防御(CJD)。CJA通过识别并激活导致有害输出的因果路径,显著提高了越狱成功率,超越了现有SOTA攻击。CJD则通过阻断或修改这些关键路径,有效增强了模型的防御能力。实验结果表明,我们的因果方法不仅提供了对越狱
📄 English Summary
A Causal Perspective for Enhancing Jailbreak Attack and Defense
Large Language Models (LLMs) face significant security challenges from jailbreak attacks, which aim to bypass safety alignments and induce harmful content generation. Existing defense mechanisms often rely on heuristics or specific attack patterns, lacking a deep understanding of the underlying mechanisms of jailbreak attacks. This paper presents the first systematic analysis of jailbreak attacks from a causal perspective. We propose a causal graph model to characterize the relationships among LLMs, prompts, jailbreak behaviors, and harmful outputs. By identifying and intervening on key causal paths, we reveal that the core of successful jailbreak attacks lies in manipulating the LLM's internal states through specific prompts, thereby circumventing safety constraints. Based on this causal insight, we develop two novel strategies: Causal Jailbreak Attack (CJA) and Causal Jailbreak Defense (CJD). CJA significantly enhances jailbreak success rates by identifying and activating causal paths leading to harmful outputs, outperforming existing state-of-the-art attacks. Conversely, CJD effectively strengthens model defenses by blocking or modifying these critical paths. Experimental results demonstrate that our causal approach not only provides profound insights into jailbreak mechanisms but also exhibits superior attack and defense performance in practical applications. This work opens new directions for future LLM security research, emphasizing the importance of causal reasoning in building more robust and secure AI systems.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等