隐式回合策略优化用于主动用户-大型语言模型交互

📄 中文摘要

多回合人机协作在自适应辅导、对话推荐和专业咨询等互动服务的部署中至关重要。然而,通过强化学习优化这些交互面临可验证的中间奖励稀疏和用户响应高度随机性的挑战。为了解决这些问题,提出了一种隐式回合策略优化(ITPO)方法。ITPO利用隐式过程奖励模型,从稀疏的结果信号中推导出细粒度的回合过程奖励。与波动性较大的令牌级奖励不同,这些回合级信号表现出更强的鲁棒性,并可以利用归一化机制进一步增强训练的稳定性。对ITPO进行了评估,结果表明其在优化人机交互方面的有效性。

📄 English Summary

Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

Multi-turn human-AI collaboration is essential for deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions through reinforcement learning is challenged by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To tackle these issues, Implicit Turn-wise Policy Optimization (ITPO) is proposed. ITPO employs an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and can utilize a normalization mechanism to further enhance training stability. Evaluations of ITPO demonstrate its effectiveness in optimizing human-AI interactions.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等