为何我们不再评估 SWE-bench Verified
📄 中文摘要
SWE-bench Verified 存在污染和误测前沿编码进展的问题,OpenAI 现推荐使用 SWE-bench Pro。其分析指出,存在缺陷的测试奖励捷径,加上训练数据泄漏导致分数虚高。尽管 Verified 的表现看似良好,但模型在实际任务上并未真正改善。因此,建议不要将 Verified 作为模型选择或产品声明的唯一依据,应该运行 SWE-bench Pro 或私有保留集,并对 Verified 分数持怀疑态度。有效基准测试的快速检查清单包括时间分割评估、运行重复/重叠扫描与训练数据的比较等。
📄 English Summary
Why we no longer evaluate SWE-bench Verified
SWE-bench Verified is found to be contaminated and mismeasures the progress of frontier coding. OpenAI now recommends SWE-bench Pro, highlighting that flawed tests reward shortcuts and training data leakage inflates scores. Verified may appear effective even when models have not genuinely improved on real tasks. Consequently, it is advised not to rely solely on Verified for model selection or product claims. Instead, running SWE-bench Pro or private holdouts is recommended, and Verified scores should be treated with skepticism. A quick checklist for sane benchmarking includes time-splitting evaluations and conducting duplicate/overlap scans against training data.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等