📄 中文摘要
奖励模型是大型语言模型对齐的核心组件,但其在实际应用中的有效性取决于对未见提示和动态分布的泛化能力。现有的大多数奖励模型评估方法依赖于静态、预标注的偏好数据集,这些数据集覆盖范围有限,并且通常无法真实评估模型在开放世界环境中的泛化性能。为了解决这一局限性,我们引入了成对最大差异竞赛(PMDC),这是一种动态且注重标注效率的评估框架。PMDC的核心思想是,通过让两个奖励模型对同一提示生成偏好,并识别它们之间分歧最大的样本对,来系统性地探索奖励模型的弱点。具体而言,给定一个提示,PMDC会促使两个奖励模型对两个不同的响应进行偏好排序,然后根据它们对这两个响应的相对偏好差异来选择具有最大分歧的提示-响应对。这种动态的选择过程使得PMDC能够主动发现奖励模型在特定领域或特定类型提示上的泛化盲点,而不仅仅是被动地在预设数据上进行测试。通过迭代地生成和评估这些具有最大分歧的样本,PMDC能够有效地识别奖励模型泛化失败的场景,并提供更具信息量的评估结果。与传统静态评估方法相比,PMDC能够更全面、更细致地揭示奖励模型在处理新颖或复杂输入时的表现,从而为奖励模型的改进和更可靠的对齐提供关键洞察。此外,PMDC的设计考虑了标注成本,通过集中于模型分歧大的样本,能够以更少的标注工作量获得更具代表性的评估数据,这对于资源受限的真实世界应用至关重要。
📄 English Summary
Evaluating Reward Model Generalization via Pairwise Maximum Discrepancy Competitions
Reward models are central to aligning large language models, but their practical effectiveness hinges on generalization to unseen prompts and shifting distributions. Most existing reward model evaluations rely on static, pre-annotated preference datasets, which provide limited coverage and often fail to faithfully assess generalization in open-world settings. We introduce Pairwise Maximum Discrepancy Competition (PMDC), a dynamic and annotation-efficient evaluation framework designed to address these limitations. At its core, PMDC systematically explores the weaknesses of reward models by having two reward models generate preferences for the same prompt and identifying sample pairs where their preferences diverge most significantly. Specifically, given a prompt, PMDC prompts two reward models to rank two different responses, then selects the prompt-response pair exhibiting the maximum discrepancy in their relative preferences for these two responses. This dynamic selection process enables PMDC to actively discover generalization blind spots of reward models in specific domains or for particular types of prompts, rather than merely passively testing on pre-established data. By iteratively generating and evaluating these maximally discrepant samples, PMDC can effectively identify scenarios where reward model generalization fails, providing more informative evaluation results. Compared to traditional static evaluation methods, PMDC offers a more comprehensive and nuanced understanding of how reward models perform when confronted with novel or complex inputs, thereby providing crucial insights for reward model improvement and more reliable alignment. Furthermore, PMDC's design considers annotation costs; by focusing on samples where models exhibit significant disagreement, it can achieve more representative evaluation data with less annotation effort, which is critical for resource-constrained real-world applications.