零依赖 LLM 评估器：信度基准与成对评估

出处: Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation

发布: 2026年2月24日

📄 中文摘要

开发了一个名为 agent-eval-lite 的零依赖 Python 框架，用于 LLM 作为评估者的评估。该框架在 FaithBench（信度）上实现了 κ=0.68 的成绩，并在 JudgeBench（成对比较）上达到了 91-100% 的 PCAcc，表现与需要 40 多个依赖项的重型框架相当。人工审核无法扩展，而 LLM 作为评估者的使用提供了一个实用的解决方案。现有的框架（如 DeepEval 和 Ragas）依赖于 torch、transformers、langchain 等多个依赖项，限制了其灵活性和可用性。

🏷️ 相关标签

#零依赖 #LLM评估 #信度 #成对比较 #Python框架

📄 English Summary

Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation

A zero-dependency Python framework named agent-eval-lite has been developed for evaluating LLMs as judges. It achieves a κ=0.68 score on FaithBench (faithfulness) and a PCAcc of 91-100% on JudgeBench (pairwise comparison), demonstrating competitive performance with heavyweight frameworks that require over 40 dependencies. Manual review does not scale, and using LLMs as judges provides a practical solution. Existing frameworks like DeepEval and Ragas rely on multiple dependencies such as torch, transformers, and langchain, which limits their flexibility and usability.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误