GPT-4 在 MMLU 上微弱领先——但差距比你想的要小

出处: GPT-4 vs Claude 3.5 vs Gemini: MMLU Zero-Shot Accuracy

发布: 2026年2月28日

📄 中文摘要

OpenAI 宣称 GPT-4 在 MMLU 上的得分为 86.4%，而 Anthropic 的 Claude 3.5 Sonnet 则为 88.7%，谷歌的 Gemini 1.5 Pro 报告得分为 85.9%。这些数据在各个模型卡和基准排行榜上广泛传播，但它们并不测量相同的内容。通过对同一 1,000 道题目的零-shot MMLU 评估，发现当控制提示格式、温度和采样时，得分差距缩小至 2-3%。关键在于哪个模型能够正确回答难题。

📄 English Summary

GPT-4 vs Claude 3.5 vs Gemini: MMLU Zero-Shot Accuracy

OpenAI claims that GPT-4 achieves a score of 86.4% on MMLU, while Anthropic's Claude 3.5 Sonnet reports 88.7%, and Google's Gemini 1.5 Pro scores 85.9%. These figures are widely displayed across model cards and benchmark leaderboards, but they do not measure the same criteria. Conducting zero-shot MMLU evaluations on the same subset of 1,000 questions reveals that the gaps narrow to 2-3% when controlling for prompt format, temperature, and sampling. The crucial factor is which model correctly answers the difficult questions.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

GPT-4 在 MMLU 上微弱领先——但差距比你想的要小

📄 中文摘要

🏷️ 相关标签

📄 English Summary

GPT-4 vs Claude 3.5 vs Gemini: MMLU Zero-Shot Accuracy

🏷️ Related Tags

📄 中文摘要

🏷️ 相关标签

📄 English Summary

GPT-4 vs Claude 3.5 vs Gemini: MMLU Zero-Shot Accuracy

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误