代理工具工程:八个月生产实践的启示

📄 中文摘要

在生产环境中,Opus 4.5未能成功构建一个生产级的网络应用,问题并不在于模型本身,而是在于其尝试一次性完成所有任务,导致在上下文窗口中留下了半成品的功能,并过早宣告成功。通过修复框架、增加进度跟踪和逐步工作流程,使用同一模型的应用开始顺利交付。有效的代理工具需要逐步披露机制:只在需要时向模型展示必要的信息。通过这种方式,模型在CORE-Bench基准测试中提升了36分。

📄 English Summary

Agent Harness Engineering: What 8 Months in Production Taught Me

Opus 4.5 failed to build a production web app not due to the model's inadequacy, but because it attempted to accomplish everything in one go, leaving half-implemented features across context windows and declaring victory prematurely. By fixing the scaffolding, adding progress tracking, and implementing incremental workflows, the same model began to deliver successfully. The key to effective harnesses for long-running agents lies in progressive disclosure: showing the model only what it needs when it needs it. This approach resulted in a 36-point increase on the CORE-Bench benchmark.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等