GT-HarmBench：通过博弈论视角评估人工智能安全风险

出处: GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

发布: 2026年2月16日

📄 中文摘要

GT-HarmBench 是一个新的基准，涵盖 2,009 个高风险场景，重点关注博弈论结构，如囚徒困境、猎鹿游戏和鸡游戏。这些场景源自麻省理工学院人工智能风险库中的现实AI风险背景。研究显示，在 15 个前沿模型中，智能体在仅 62% 的情况下选择社会有益的行动，常常导致有害结果。此外，研究还测量了博弈论提示的框架和顺序对结果的敏感性，并分析了导致失败的推理模式。

🏷️ 相关标签

#人工智能 #安全风险 #博弈论 #基准测试 #多智能体环境

📄 English Summary

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

GT-HarmBench is a novel benchmark comprising 2,009 high-stakes scenarios that focus on game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt, and Chicken. These scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. The findings reveal that across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently resulting in harmful outcomes. Furthermore, the study measures sensitivity to the framing and ordering of game-theoretic prompts and analyzes reasoning patterns that contribute to failures.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误