带有过程奖励的截断步级采样用于检索增强推理

出处: Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

发布: 2026年3月2日

📄 中文摘要

训练大型语言模型通过强化学习与搜索引擎进行推理时，面临着根本的信用分配问题：现有方法如Search-R1仅在整个多步轨迹后提供稀疏的结果奖励，使得将成功或失败归因于单个推理和检索决策变得不可行。过程奖励方法如StepSearch通过引入步级监督来缓解这一问题，但依赖于与黄金文档的TF-IDF重叠等启发式奖励，并且仍然为每个示例采样k个完整轨迹，导致高梯度方差。SLATE框架基于两个互补的思想：截断步级采样生成k个共享的轨迹，以及通过过程奖励优化推理过程，从而提高模型的推理能力和稳定性。

🏷️ 相关标签

#步级采样 #过程奖励 #检索增强推理 #信用分配 #强化学习

📄 English Summary

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

Training large language models to reason with search engines via reinforcement learning faces a fundamental credit assignment problem: existing methods like Search-R1 provide only sparse outcome rewards after an entire multi-step trajectory, making it challenging to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods such as StepSearch alleviate this issue by introducing step-level supervision but rely on heuristic rewards like TF-IDF overlap with gold documents and still sample k complete trajectories per example, leading to high gradient variance. The proposed SLATE framework is built on two complementary ideas: truncated step-level sampling that generates k trajectories sharing common elements and process rewards that optimize the reasoning process, thereby enhancing the model's reasoning capabilities and stability.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误