SWE-bench 2025年2月排行榜更新

出处: SWE-bench February 2025 leaderboard update

发布: 2026年2月19日

📄 中文摘要

SWE-bench 是一个备受实验室青睐的基准测试，通常会在模型发布时列出。尽管官方排行榜更新不频繁，但最近对当前一代模型进行了全面测试，这一更新值得关注，因为这些基准结果并非实验室自我报告。此次更新的结果针对的是“仅限 Bash”基准，使用了 mini-swe-bench 代理（约9000行Python代码）进行测试，提供了使用的提示信息。这些结果为评估当前模型的性能提供了重要依据。

🏷️ 相关标签

#SWE-bench #排行榜 #基准测试 #模型性能 #Python

📄 English Summary

SWE-bench February 2025 leaderboard update

SWE-bench is a highly regarded benchmark often cited by labs in their model releases. The official leaderboard is rarely updated, but a recent comprehensive run against the current generation of models has been conducted, which is noteworthy as these benchmark results are not self-reported by the labs. The fresh results pertain to their 'Bash Only' benchmark, which tests the mini-swe-bench agent (approximately 9,000 lines of Python code) and includes the prompts used for the evaluation. These results provide crucial insights into the performance of current models.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

SWE-bench February 2025 leaderboard update

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误