通过运行时事实实现 83.4% 的修复率的 SWE-bench 验证

📄 中文摘要

在最新的 SWE-bench 验证测试中,验证了一种新的 AI 调试范式:基于运行时事实的系统调试。通过在 Live-SWE-agent 架构中引入动态追踪机制,为模型提供运行时上下文,使用 Google Gemini 3 Pro 模型实现了理论上的 83.4% 修复率,标志着迄今为止在 SWE-bench 验证评估中已知的最高性能。与同一模型在原始 Live-SWE-agent 上的 77.4% 基线性能相比,成功修复了以前无法解决的复杂错误,充分利用了运行时事实作为决策依据。

📄 English Summary

Achieving an 83.4% Fix Rate on SWE-bench Verified with Runtime Facts

The latest SWE-bench Verified tests validated a new AI debugging paradigm: systematic debugging based on Runtime Facts. By introducing a dynamic tracing mechanism into the Live-SWE-agent architecture to provide the model with runtime context, a theoretical combined fix rate of 83.4% was achieved using the Google Gemini 3 Pro model, marking the highest known performance on the SWE-bench Verified evaluation to date. Compared to the baseline performance of 77.4% of the same model on the original Live-SWE-agent, complex bugs that were previously unsolvable were successfully fixed by leveraging Runtime Facts as a decision-making tool.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等