使用 promptfoo 为 184 个 AI 代理提示构建评估工具

📄 中文摘要

Agency-agents 是一个开源项目,包含 184 个专业 AI 代理提示,涵盖后端架构师、用户体验设计师、历史学家和游戏开发者等领域。每个提示都以详细的 markdown 文件形式呈现,包含身份、工作流程、交付模板和成功指标。然而,尚无有效的方法来评估这些提示的输出质量。通过构建一个基于 promptfoo 的评估工具,可以利用 LLM 作为评判者自动评分,初步运行已发现实际的质量差距。

📄 English Summary

Build an eval harness for 184 AI agent prompts with promptfoo

Agency-agents is an open-source collection of 184 specialist AI agent prompts, covering fields such as backend architects, UX designers, historians, and game developers. Each prompt is presented as a detailed markdown file, including identity, workflows, deliverable templates, and success metrics. However, there is currently no effective way to assess the output quality of these prompts. By building a promptfoo-based eval harness, it is possible to automatically score them using LLM as a judge, and the initial run has already identified a significant quality gap.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等