📄 中文摘要
该教程展示了如何根据自定义标准对每个代码生成响应进行评分,帮助团队选择最适合其安全要求、API 规范和盲点的 AI 模型。通过设置自定义评估者,检查实际关心的漏洞,验证真实的 API 约定,并标记团队常遇到的范围蔓延模式,团队可以在几周的数据积累后,获得选择适合任务的模型的证据。教程中将构建一个代理服务器,通过 LaunchDarkly 路由 Claude Code 请求,能够将请求转发到任何模型,包括 Anthropic、OpenAI、Mistral 或本地 Ollama 实例。
📄 English Summary
Evaluate LLM code generation with LLM-as-judge evaluators
This tutorial demonstrates how to score every code generation response against custom criteria, enabling teams to select the AI model that best meets their security requirements, API schemas, and blind spots. By setting up custom judges that check for vulnerabilities of actual concern, validate against real API conventions, and flag common scope creep patterns, teams can gather evidence over a few weeks to determine which model is suitable for which tasks. The tutorial involves building a proxy server that routes Claude Code requests through LaunchDarkly, allowing requests to be forwarded to any model, including Anthropic, OpenAI, Mistral, or local Ollama instances.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等