📄 中文摘要
软件测试在几十年前就已经得到了解决,开发者编写函数、断言输出、持续集成(CI)通过后便可以发布。然而,随着大型语言模型(LLM)代理的出现,这一契约被完全打破。当前,很多团队尚未意识到这一问题。代理的输出在不同时间可能会有所不同,即使输入相同,更新模型、调整提示或改变上下文窗口后,输出也会有所变化。这种变化可能不会导致明显的错误,但足以让下游系统在不知不觉中出现故障,尤其是在关键时刻。此问题在生产环境中已经发生,影响了多个公司的运营。
📄 English Summary
Why Agent Testing is Broken
Software testing has been effectively solved for decades, where developers write functions, assert outputs, and ship once the CI turns green. However, the advent of large language model (LLM) agents has completely disrupted this contract, and many teams have yet to realize the implications. When querying an agent for a task like summarizing a contract, the response can vary significantly from one day to the next due to model updates, prompt adjustments, or context window changes. These differences are not necessarily incorrect but can lead to downstream systems failing silently at critical moments. This issue is currently affecting production environments across various companies.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等