基于提示的对齐存在上限——三模型囚徒困境证据

📄 中文摘要

在囚徒困境中,两个玩家每轮同时选择“合作”或“背叛”。互相合作可以获得相对不错的收益,而单方面背叛则能让背叛者获得丰厚的奖励。经过多轮游戏,玩家会开始判断对方的策略。与TitForTat合作并不意味着善良,而是期望值最大化的策略。要真正评估一个AI是否“合作”,需要观察其在理性最优策略为背叛时的表现。

📄 English Summary

Prompt-Based Alignment Has a Ceiling — 3-Model Prisoner's Dilemma Evidence

The Prisoner's Dilemma involves two players who simultaneously choose to 'cooperate' or 'defect' each round. Mutual cooperation yields a decent payoff, while one-sided defection rewards the defector significantly. Over multiple rounds, players begin to assess each other's strategies. Cooperating with TitForTat is not an act of kindness but a strategy for expected-value maximization. To truly evaluate whether an AI is 'cooperative,' one must observe its actions when the rational optimal play is to defect.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等