12000美元的周末:关于生产环境中LLM代理的隐秘真相
📄 中文摘要
一个自主代理在周末运行,到了周一,它进行了47,000次API调用。没有设定预算上限,也没有强制重试限制。代理遇到临时API错误后进行了重试,接着又遇到另一个错误,继续重试,最终持续运行了60小时,因为没有任何机制告诉它停止。这种情况并非个例,Simon Willison记录了这一模式,相关讨论在r/MachineLearning的帖子中获得了800个赞。虽然费用数字有所不同,分别为3000美元、8000美元和12000美元,但模式始终如一:重试循环、没有上限、无人监管。为了改善这种情况,最初尝试了更好的可观察性,包括设置成本警报和仪表板,但这些措施并未根本解决问题。
📄 English Summary
The $12K Weekend: What Nobody Tells You About LLM Agents in Production
An autonomous agent operated over a weekend and made 47,000 API calls by Monday. There was no budget ceiling or enforced retry limit. The agent encountered transient API errors, retried, hit more errors, and kept running for 60 hours without any mechanism to stop it. This is not an isolated incident; Simon Willison has documented this pattern, and a thread on r/MachineLearning received 800 upvotes. The costs vary—$3K, $8K, $12K—but the underlying issue remains the same: retry loops, no ceilings, and no oversight. Initial attempts to improve the situation focused on better observability, such as setting up cost alerts and dashboards, but these measures did not address the root cause.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等