📄 中文摘要
SWE-bench 是一个备受实验室青睐的基准测试,通常会在模型发布时列出。尽管官方排行榜更新不频繁,但最近对当前一代模型进行了全面测试,这一更新值得关注,因为这些基准结果并非实验室自我报告。此次更新的结果针对的是“仅限 Bash”基准,使用了 mini-swe-bench 代理(约9000行Python代码)进行测试,提供了使用的提示信息。这些结果为评估当前模型的性能提供了重要依据。
📄 English Summary
SWE-bench February 2025 leaderboard update
SWE-bench is a highly regarded benchmark often cited by labs in their model releases. The official leaderboard is rarely updated, but a recent comprehensive run against the current generation of models has been conducted, which is noteworthy as these benchmark results are not self-reported by the labs. The fresh results pertain to their 'Bash Only' benchmark, which tests the mini-swe-bench agent (approximately 9,000 lines of Python code) and includes the prompts used for the evaluation. These results provide crucial insights into the performance of current models.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等