LLM 评估与基准测试 MCP 服务器 — promptfoo、DeepEval、MCP-Bench、红队测试

📄 中文摘要

当前的 LLM 评估工具生态系统已经相当成熟,得到了来自 Accenture、Salesforce 和 Alibaba/ModelScope 的贡献,涵盖了完整的评估生命周期,包括单元测试、基准测试、红队测试以及 LLM 作为评判者的功能。一个显著的发现是,即使是 GPT-5 在真实世界的 MCP 任务中也仅取得了 43.72% 的成绩。评估工具 promptfoo 是一个广泛使用的 CLI 和库,支持 300K+ 开发者和 127 家财富 500 强公司,能够通过声明式 YAML 配置比较 GPT、Claude、Gemini 和 Llama 的输出,并且其红队模块能够扫描 50 多种漏洞类型。

📄 English Summary

LLM Evaluation & Benchmarking MCP Servers — promptfoo, DeepEval, MCP-Bench, Red-Teaming

The LLM evaluation tool ecosystem has matured significantly, with contributions from Accenture, Salesforce, and Alibaba/ModelScope, covering the entire evaluation lifecycle, including unit testing, benchmarking, red-teaming, and LLM-as-a-judge functionalities. A notable insight is that even GPT-5 achieves only 43.72% on real-world MCP tasks. The evaluation tool promptfoo is widely used as a CLI and library by over 300K developers and 127 Fortune 500 companies, allowing for output comparisons across GPT, Claude, Gemini, and Llama using declarative YAML configurations. Its red-teaming module scans for over 50 types of vulnerabilities.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等