CARE:考虑混杂因素的聚合方法用于可靠的大型语言模型评估
📄 中文摘要
大型语言模型(LLM)作为评判者的集成是可扩展评估的标准范式,但其聚合机制存在根本缺陷:隐含假设评判者提供独立的真实质量估计。然而,实际上,LLM评判者表现出相关错误,这些错误源于共享的潜在混杂因素,例如冗长性、风格偏好或训练伪影,导致标准聚合规则(如多数投票或平均)提供的增益有限,甚至可能放大系统性错误。为了解决这一问题,提出了一种名为CARE的混杂因素感知聚合框架,该框架明确将LLM评判者的评分建模为源自潜在真实质量信号和共享混杂因素。该方法通过显式建模,旨在提高评估的可靠性和准确性。
📄 English Summary
CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation
The standard paradigm for scalable evaluation employs LLM-as-a-judge ensembles, yet their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. In reality, LLM judges exhibit correlated errors due to shared latent confounders, such as verbosity, stylistic preferences, or training artifacts. This correlation can lead standard aggregation rules, like majority voting or averaging, to yield minimal gains or even exacerbate systematic errors. To tackle this issue, a new framework called CARE is introduced, which explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. By modeling these influences, CARE aims to enhance the reliability and accuracy of evaluations.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等