真实流量下的 LLM 评估——不仅仅是测试套件

出处: LLM Evals on Real Traffic — Not Just Test Suites

发布: 2026年3月21日

📄 中文摘要

许多团队意识到需要评估其 LLM 输出，但在生产环境中实际执行的却不多。通常的设置是构建一个包含少量黄金示例的测试套件，在部署前通过 CI 运行，并希望这些示例能代表真实用户的输入。然而，用户在生产中编写的提示往往比测试用例更复杂、更长且更奇特。重要的边缘案例往往未被考虑。与此同时，实际的请求和响应数据每天在 AI 管道中流动，却被记录在日志中，直到出现故障才被查看。评估应在数据已经存在的地方进行。

🏷️ 相关标签

#LLM评估 #生产环境 #测试套件 #用户输入 #数据流

📄 English Summary

LLM Evals on Real Traffic — Not Just Test Suites

Many teams recognize the need to evaluate their LLM outputs, but few actually implement this in production. The typical approach involves creating a test suite with a small number of golden examples, running it in CI before deployments, and hoping these examples are representative of real user inputs. However, prompts written by users in production are often messier, longer, and weirder than anything in the test fixtures. The critical edge cases are usually the ones not considered. Meanwhile, the actual requests and responses flowing through the AI pipeline daily remain in logs that go unnoticed until something breaks. Evaluations should occur where the data already exists.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

LLM Evals on Real Traffic — Not Just Test Suites

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误