五个模型，467个动作，一个赢家——我们在真实代码生成中比较LLM的经验

出处: 5 Models, 467 Actions, 1 Winner — What We Learned Comparing LLMs on Real Code Generation

发布: 2026年3月30日

📄 中文摘要

对五个AI模型进行了467次相同任务的测试，每次运行都生成了一个完整的可部署网站，包括HTML、CSS、JavaScript及相关资产。研究的核心问题是：较便宜的模型是否能与Claude Sonnet在生产代码生成方面相匹敌？结果显示，较便宜的模型无法达到Claude Sonnet的水平，但更深入的分析揭示了更有趣的发现。五个模型的成本差异达到15倍，测试的模型包括Claude Sonnet 4.6和Claude Haiku 4.5等，后者被视为金标准。通过这些测试，研究提供了对不同模型在实际应用中的表现的深入理解。

🏷️ 相关标签

#AI模型 #代码生成 #Claude Sonnet #生产代码 #网站开发

📄 English Summary

5 Models, 467 Actions, 1 Winner — What We Learned Comparing LLMs on Real Code Generation

Five AI models were tested on the same task 467 times, with each run producing a complete deployable website, including HTML, CSS, JavaScript, and assets. The central question was whether cheaper models could match Claude Sonnet's capabilities in production code generation. The findings indicate that cheaper models do not reach Claude Sonnet's level, but the longer analysis reveals more interesting insights. The models tested spanned a cost range of 15 times, including Claude Sonnet 4.6, assumed to be the gold standard, and Claude Haiku 4.5. This research provides an in-depth understanding of the performance of different models in real-world applications.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

5 Models, 467 Actions, 1 Winner — What We Learned Comparing LLMs on Real Code Generation

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误