审计守门人:模糊测试“AI法官”以绕过安全控制

📄 中文摘要

研究揭示了AI法官(即作为自动安全守门人的大型语言模型)中的一个关键漏洞。研究人员开发了AdvJudge-Zero,这是一种自动化模糊测试工具,旨在识别能够操控模型决策逻辑的隐蔽输入序列。通过利用这些模型对无害格式符号和结构标记的解释,该工具能够迫使模型对原本被禁止的内容做出“允许”决策。与依赖可检测的无意义字符的传统对抗攻击不同,AdvJudge-Zero识别低困惑度触发器,如Markdown标题和特定角色指示符。这些触发器对人类观察者和标准Web应用防火墙几乎是不可见的,但却能有效地影响模型的判断。

📄 English Summary

Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls

The research reveals a critical vulnerability in AI judges, which are large language models (LLMs) used as automated security gatekeepers. Researchers developed AdvJudge-Zero, an automated fuzzer designed to identify stealthy input sequences that manipulate the decision-making logic of these models. By exploiting how these models interpret benign formatting symbols and structural tokens, the tool can force an 'allow' decision on otherwise prohibited content. Unlike traditional adversarial attacks that rely on detectable gibberish, AdvJudge-Zero identifies low-perplexity triggers such as markdown headers and specific role indicators. These triggers are virtually invisible to human observers and standard web application firewalls, yet they effectively influence the model's judgments.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等