GPT-4 在 MMLU 上微弱领先——但差距比你想的要小
📄 中文摘要
OpenAI 宣称 GPT-4 在 MMLU 上的得分为 86.4%,而 Anthropic 的 Claude 3.5 Sonnet 则为 88.7%,谷歌的 Gemini 1.5 Pro 报告得分为 85.9%。这些数据在各个模型卡和基准排行榜上广泛传播,但它们并不测量相同的内容。通过对同一 1,000 道题目的零-shot MMLU 评估,发现当控制提示格式、温度和采样时,得分差距缩小至 2-3%。关键在于哪个模型能够正确回答难题。
📄 English Summary
GPT-4 vs Claude 3.5 vs Gemini: MMLU Zero-Shot Accuracy
OpenAI claims that GPT-4 achieves a score of 86.4% on MMLU, while Anthropic's Claude 3.5 Sonnet reports 88.7%, and Google's Gemini 1.5 Pro scores 85.9%. These figures are widely displayed across model cards and benchmark leaderboards, but they do not measure the same criteria. Conducting zero-shot MMLU evaluations on the same subset of 1,000 questions reveals that the gaps narrow to 2-3% when controlling for prompt format, temperature, and sampling. The crucial factor is which model correctly answers the difficult questions.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等