代理工具工程：八个月生产实践的启示

出处: Agent Harness Engineering: What 8 Months in Production Taught Me

发布: 2026年3月6日

📄 中文摘要

在生产环境中，Opus 4.5未能成功构建一个生产级的网络应用，问题并不在于模型本身，而是在于其尝试一次性完成所有任务，导致在上下文窗口中留下了半成品的功能，并过早宣告成功。通过修复框架、增加进度跟踪和逐步工作流程，使用同一模型的应用开始顺利交付。有效的代理工具需要逐步披露机制：只在需要时向模型展示必要的信息。通过这种方式，模型在CORE-Bench基准测试中提升了36分。

🏷️ 相关标签

#代理工具 #生产环境 #逐步披露 #模型性能 #基准测试

📄 English Summary

Agent Harness Engineering: What 8 Months in Production Taught Me

Opus 4.5 failed to build a production web app not due to the model's inadequacy, but because it attempted to accomplish everything in one go, leaving half-implemented features across context windows and declaring victory prematurely. By fixing the scaffolding, adding progress tracking, and implementing incremental workflows, the same model began to deliver successfully. The key to effective harnesses for long-running agents lies in progressive disclosure: showing the model only what it needs when it needs it. This approach resulted in a 36-point increase on the CORE-Bench benchmark.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Agent Harness Engineering: What 8 Months in Production Taught Me

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误