一个接一个的偏见:机械奖励塑形与语言奖励模型中的持续偏见
📄 中文摘要
奖励模型(RMs)在将语言模型(LMs)与人类偏好进行在线对齐中至关重要。然而,基于奖励模型的偏好调优容易受到奖励黑客攻击,导致语言模型策略从缺陷奖励模型中学习到不良行为。通过系统性地测量五个高质量奖励模型中的偏见,包括最先进的模型,发现尽管已有研究,但在长度、谄媚和过度自信等方面的问题依然存在。此外,发现与模型特定风格和回答顺序相关的新问题。根据复杂性对奖励模型的失败进行分类,并提出了一种简单的后处理干预措施,以减轻由虚假相关性引起的低复杂性偏见。所提出的机械奖励塑形方法在不引入新的偏见的情况下减少了目标偏见。
📄 English Summary
One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models
Reward Models (RMs) play a crucial role in the online alignment of language models (LMs) with human preferences. However, preference tuning based on RMs is susceptible to reward hacking, where LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including state-of-the-art models, persistent issues related to length, sycophancy, and overconfidence are identified despite prior work. New biases related to model-specific styles and answer order are also discovered. RM failures are categorized by complexity, and a simple post-hoc intervention is proposed to mitigate low-complexity biases arising from spurious correlations. The proposed mechanistic reward shaping effectively reduces targeted biases without introducing new ones.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等