📄 中文摘要
大型语言模型(LLMs)虽然具备复杂的通用能力,但在对齐多样化个体偏好方面常常表现不佳。标准的后训练方法,如基于人类反馈的强化学习(RLHF),通常优化单一的全局目标。组相对策略优化(GRPO)作为一种广泛采用的在线强化学习框架,其基于组的归一化隐含假设所有样本是可交换的,这在个性化设置中继承了这一局限性。这一假设混淆了不同用户的奖励分布,系统性地将学习偏向于主流偏好,同时抑制了少数信号。为了解决这一问题,提出了个性化组相对策略优化(P-GRPO),该方法旨在更好地处理异质用户偏好,提升模型的个性化能力。
📄 English Summary
Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment
Large Language Models (LLMs) exhibit sophisticated general-purpose capabilities but often struggle to align with diverse individual preferences due to standard post-training methods like Reinforcement Learning with Human Feedback (RLHF), which optimize for a single global objective. Group Relative Policy Optimization (GRPO), a widely used on-policy reinforcement learning framework, implicitly assumes that all samples are exchangeable, inheriting this limitation in personalized contexts. This assumption conflates distinct user reward distributions, systematically biasing learning towards dominant preferences while suppressing minority signals. To address this issue, Personalized Group Relative Policy Optimization (P-GRPO) is introduced, aiming to better handle heterogeneous user preferences and enhance the model's personalization capabilities.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等