📄 中文摘要
许多团队意识到需要评估其 LLM 输出,但在生产环境中实际执行的却不多。通常的设置是构建一个包含少量黄金示例的测试套件,在部署前通过 CI 运行,并希望这些示例能代表真实用户的输入。然而,用户在生产中编写的提示往往比测试用例更复杂、更长且更奇特。重要的边缘案例往往未被考虑。与此同时,实际的请求和响应数据每天在 AI 管道中流动,却被记录在日志中,直到出现故障才被查看。评估应在数据已经存在的地方进行。
📄 English Summary
LLM Evals on Real Traffic — Not Just Test Suites
Many teams recognize the need to evaluate their LLM outputs, but few actually implement this in production. The typical approach involves creating a test suite with a small number of golden examples, running it in CI before deployments, and hoping these examples are representative of real user inputs. However, prompts written by users in production are often messier, longer, and weirder than anything in the test fixtures. The critical edge cases are usually the ones not considered. Meanwhile, the actual requests and responses flowing through the AI pipeline daily remain in logs that go unnoticed until something breaks. Evaluations should occur where the data already exists.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等