代理测试为何失效

发布: 2026年2月25日

📄 中文摘要

软件测试在几十年前就已经得到了解决，开发者编写函数、断言输出、持续集成（CI）通过后便可以发布。然而，随着大型语言模型（LLM）代理的出现，这一契约被完全打破。当前，很多团队尚未意识到这一问题。代理的输出在不同时间可能会有所不同，即使输入相同，更新模型、调整提示或改变上下文窗口后，输出也会有所变化。这种变化可能不会导致明显的错误，但足以让下游系统在不知不觉中出现故障，尤其是在关键时刻。此问题在生产环境中已经发生，影响了多个公司的运营。

🏷️ 相关标签

#代理测试 #大型语言模型 #软件测试 #输出变化 #生产环境

📄 English Summary

Why Agent Testing is Broken

Software testing has been effectively solved for decades, where developers write functions, assert outputs, and ship once the CI turns green. However, the advent of large language model (LLM) agents has completely disrupted this contract, and many teams have yet to realize the implications. When querying an agent for a task like summarizing a contract, the response can vary significantly from one day to the next due to model updates, prompt adjustments, or context window changes. These differences are not necessarily incorrect but can lead to downstream systems failing silently at critical moments. This issue is currently affecting production environments across various companies.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Why Agent Testing is Broken

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误