📄 中文摘要
大型语言模型(LLM)代理在长时间跨度的多步骤任务中常面临显著的信用分配挑战,尤其是在奖励稀疏的情况下。现有的无值方法,如群体相对策略优化(GRPO),存在两个基本瓶颈:步骤级 Q 值估计不准确和中间状态的价值基线不对齐。为了解决这些局限性,提出了 HCAPO,这是第一个将事后信用分配集成到 LLM 代理中的框架。HCAPO 利用 LLM 本身作为事后评论者,通过事后推理来细化步骤级 Q 值。此外,HCAPO 的多尺度优势机制在关键决策状态有效补充了不准确的价值基线。
📄 English Summary
Hindsight Credit Assignment for Long-Horizon LLM Agents
Large Language Model (LLM) agents often encounter significant credit assignment challenges in long-horizon, multi-step tasks due to sparse rewards. Existing value-free methods, such as Group Relative Policy Optimization (GRPO), face two fundamental bottlenecks: inaccurate step-level Q-value estimation and misaligned value baselines for intermediate states. To address these limitations, HCAPO is introduced as the first framework to integrate hindsight credit assignment into LLM agents. HCAPO leverages the LLM itself as a post-hoc critic to refine step-level Q-values through hindsight reasoning. Additionally, HCAPO's multi-scale advantage mechanism effectively supplements the inaccurate value baselines at critical decision states.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等