如何在真实代码上测试大型语言模型的性能，而非合成基准

出处: How to Test LLM Performance on Real Code Instead of Synthetic Benchmarks

发布: 2026年2月26日

📄 中文摘要

大型语言模型（LLM）在合成基准测试中表现出色，例如在HumanEval中得分87%。然而，当应用于实际代码库时，由于跨文件依赖、内部框架和遗留模式，准确率却降至约30%。合成基准测试通常只测试孤立的函数，输入和输出都很干净，而真实的软件工程环境则复杂得多。为了解决这一问题，提供了一种构建基于自身代码的评估数据集的方法，强调了在生产用例中真正重要的指标，并介绍了如何将LLM测试集成到CI/CD管道中，以便在问题影响团队之前及时发现性能问题。

🏷️ 相关标签

#大型语言模型 #合成基准 #代码评估 #CI/CD管道 #性能测试

📄 English Summary

How to Test LLM Performance on Real Code Instead of Synthetic Benchmarks

Large Language Models (LLMs) often achieve impressive scores, such as 87% on HumanEval, but their performance can drop to around 30% when applied to actual codebases due to complexities like cross-file dependencies, internal frameworks, and legacy patterns. Synthetic benchmarks typically test isolated functions with clean inputs and outputs, which do not reflect the realities of software engineering. This guide outlines how to create evaluation datasets from your own code, identifies the metrics that truly matter for production use cases, and explains how to integrate LLM testing into your CI/CD pipeline to catch performance issues before they impact your team.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

How to Test LLM Performance on Real Code Instead of Synthetic Benchmarks

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误