如何在真实代码上测试大型语言模型的性能,而非合成基准

📄 中文摘要

大型语言模型(LLM)在合成基准测试中表现出色,例如在HumanEval中得分87%。然而,当应用于实际代码库时,由于跨文件依赖、内部框架和遗留模式,准确率却降至约30%。合成基准测试通常只测试孤立的函数,输入和输出都很干净,而真实的软件工程环境则复杂得多。为了解决这一问题,提供了一种构建基于自身代码的评估数据集的方法,强调了在生产用例中真正重要的指标,并介绍了如何将LLM测试集成到CI/CD管道中,以便在问题影响团队之前及时发现性能问题。

📄 English Summary

How to Test LLM Performance on Real Code Instead of Synthetic Benchmarks

Large Language Models (LLMs) often achieve impressive scores, such as 87% on HumanEval, but their performance can drop to around 30% when applied to actual codebases due to complexities like cross-file dependencies, internal frameworks, and legacy patterns. Synthetic benchmarks typically test isolated functions with clean inputs and outputs, which do not reflect the realities of software engineering. This guide outlines how to create evaluation datasets from your own code, identifies the metrics that truly matter for production use cases, and explains how to integrate LLM testing into your CI/CD pipeline to catch performance issues before they impact your team.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等