📄 中文摘要
强化学习与评分奖励(RLRR)是一种扩展传统人类反馈强化学习(RLHF)和可验证奖励(RLVR)的框架,通过用结构化的多维情境评分评估替代标量偏好信号。然而,现有的RLRR方法仅限于将向量奖励线性压缩为固定权重的标量奖励,这种方法对人工评分设计敏感,且无法捕捉奖励维度之间的相关性。为了解决奖励聚合的局限性,提出了交替强化学习与评分奖励(ARL-RR)框架,该框架通过优化一个语义评分来消除对固定标量化的需求,从而实现更灵活和有效的奖励机制。
📄 English Summary
Alternating Reinforcement Learning with Contextual Rubric Rewards
Reinforcement Learning with Rubric Rewards (RLRR) extends conventional Reinforcement Learning from Human Feedback (RLHF) and Verifiable Rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional contextual rubric evaluations. Existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with fixed weightings, which are sensitive to artificial score design and fail to capture correlations among reward dimensions. To address these limitations in reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for fixed scalarization by optimizing one semantic rubric, enabling a more flexible and effective reward mechanism.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等