参考文献在不可验证领域中改善大语言模型对齐

📄 中文摘要

在可验证奖励的强化学习(RLVR)在推理任务中表现出强大效果的背景下,针对缺乏真实验证者的不可验证领域(如大语言模型对齐),该研究探索了参考引导的大语言模型评估者是否能够作为软“验证者”来填补这一空白。研究设计了评估协议,通过参考输出增强大语言模型评估者的能力。实验结果表明,参考引导的方法显著提高了能力较弱的大语言模型评判者的准确性,使用来自前沿模型的参考资料;而更强的大语言模型评判者也可以通过高质量(即人类撰写的)参考资料得到提升。

📄 English Summary

References Improve LLM Alignment in Non-Verifiable Domains

This research investigates the potential of reference-guided LLM evaluators to act as soft 'verifiers' in non-verifiable domains, such as LLM alignment, where Reinforcement Learning with Verifiable Rewards (RLVR) cannot be directly applied. Evaluation protocols are designed to enhance LLM-based evaluators using reference outputs. Comprehensive experiments demonstrate that a reference-guided approach significantly improves the accuracy of less capable LLM judges when using references from frontier models. Moreover, stronger LLM judges can also benefit from high-quality (i.e., human-written) references, leading to enhanced performance.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等