零依赖 LLM 评估器:信度基准与成对评估

📄 中文摘要

开发了一个名为 agent-eval-lite 的零依赖 Python 框架,用于 LLM 作为评估者的评估。该框架在 FaithBench(信度)上实现了 κ=0.68 的成绩,并在 JudgeBench(成对比较)上达到了 91-100% 的 PCAcc,表现与需要 40 多个依赖项的重型框架相当。人工审核无法扩展,而 LLM 作为评估者的使用提供了一个实用的解决方案。现有的框架(如 DeepEval 和 Ragas)依赖于 torch、transformers、langchain 等多个依赖项,限制了其灵活性和可用性。

📄 English Summary

Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation

A zero-dependency Python framework named agent-eval-lite has been developed for evaluating LLMs as judges. It achieves a κ=0.68 score on FaithBench (faithfulness) and a PCAcc of 91-100% on JudgeBench (pairwise comparison), demonstrating competitive performance with heavyweight frameworks that require over 40 dependencies. Manual review does not scale, and using LLMs as judges provides a practical solution. Existing frameworks like DeepEval and Ragas rely on multiple dependencies such as torch, transformers, and langchain, which limits their flexibility and usability.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等