Claude 阻止了所有 164 次攻击,而 GPT-4o-mini 失败率达 53%。两者的区别在于哪里?

📄 中文摘要

在对 764 次代理运行进行的跟踪中,使用了加密的金丝雀令牌来监测提示注入的情况。结果显示,不同模型的表现差异显著。Claude 模型成功阻止了所有 164 次攻击,而 GPT-4o-mini 的失败率高达 53%。这种差异可能与模型的架构、训练数据及其处理提示的能力有关。通过对比这两种模型的表现,可以更深入地理解 AI 在安全性和鲁棒性方面的挑战与机遇。

📄 English Summary

Claude Blocked All 164 Attacks. GPT-4o-mini Failed 53%. Here's the Difference.

The tracking of 764 agent runs using cryptographic canary tokens revealed significant differences in prompt injection resistance among various models. Claude successfully blocked all 164 attacks, while GPT-4o-mini exhibited a failure rate of 53%. This discrepancy may be attributed to differences in model architecture, training data, and their respective abilities to handle prompts. A comparative analysis of these models provides deeper insights into the challenges and opportunities related to AI security and robustness.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等