审计守门人：模糊测试“AI法官”以绕过安全控制

出处: Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls

发布: 2026年3月12日

📄 中文摘要

研究揭示了AI法官（即作为自动安全守门人的大型语言模型）中的一个关键漏洞。研究人员开发了AdvJudge-Zero，这是一种自动化模糊测试工具，旨在识别能够操控模型决策逻辑的隐蔽输入序列。通过利用这些模型对无害格式符号和结构标记的解释，该工具能够迫使模型对原本被禁止的内容做出“允许”决策。与依赖可检测的无意义字符的传统对抗攻击不同，AdvJudge-Zero识别低困惑度触发器，如Markdown标题和特定角色指示符。这些触发器对人类观察者和标准Web应用防火墙几乎是不可见的，但却能有效地影响模型的判断。

🏷️ 相关标签

#AI法官 #模糊测试 #安全控制 #决策逻辑 #漏洞

📄 English Summary

Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls

The research reveals a critical vulnerability in AI judges, which are large language models (LLMs) used as automated security gatekeepers. Researchers developed AdvJudge-Zero, an automated fuzzer designed to identify stealthy input sequences that manipulate the decision-making logic of these models. By exploiting how these models interpret benign formatting symbols and structural tokens, the tool can force an 'allow' decision on otherwise prohibited content. Unlike traditional adversarial attacks that rely on detectable gibberish, AdvJudge-Zero identifies low-perplexity triggers such as markdown headers and specific role indicators. These triggers are virtually invisible to human observers and standard web application firewalls, yet they effectively influence the model's judgments.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误