带有过程奖励的截断步级采样用于检索增强推理

📄 中文摘要

训练大型语言模型通过强化学习与搜索引擎进行推理时,面临着根本的信用分配问题:现有方法如Search-R1仅在整个多步轨迹后提供稀疏的结果奖励,使得将成功或失败归因于单个推理和检索决策变得不可行。过程奖励方法如StepSearch通过引入步级监督来缓解这一问题,但依赖于与黄金文档的TF-IDF重叠等启发式奖励,并且仍然为每个示例采样k个完整轨迹,导致高梯度方差。SLATE框架基于两个互补的思想:截断步级采样生成k个共享的轨迹,以及通过过程奖励优化推理过程,从而提高模型的推理能力和稳定性。

📄 English Summary

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

Training large language models to reason with search engines via reinforcement learning faces a fundamental credit assignment problem: existing methods like Search-R1 provide only sparse outcome rewards after an entire multi-step trajectory, making it challenging to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods such as StepSearch alleviate this issue by introducing step-level supervision but rely on heuristic rewards like TF-IDF overlap with gold documents and still sample k complete trajectories per example, leading to high gradient variance. The proposed SLATE framework is built on two complementary ideas: truncated step-level sampling that generates k trajectories sharing common elements and process rewards that optimize the reasoning process, thereby enhancing the model's reasoning capabilities and stability.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等