ATPO: 自适应树策略优化用于多轮医疗对话

📄 中文摘要

有效的信息获取在多轮医疗对话中对准确诊断至关重要,尤其是在面对不完整信息时。用户与代理之间的互动中固有的不确定性使得对大型语言模型(LLMs)的对齐变得具有挑战性,这一过程被形式化为层次化马尔可夫决策过程(H-MDP)。传统的强化学习方法如群体相对策略优化(GRPO)在长时间信用分配方面存在困难,而近端策略优化(PPO)在此背景下则面临不稳定的价值估计问题。为此,提出了一种新颖的基于不确定性的自适应树策略优化(ATPO)算法,该方法能够自适应地分配回滚预算,以提高多轮对话中的决策效率和准确性。

📄 English Summary

ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

Effective information seeking in multi-turn medical dialogues is crucial for accurate diagnosis, particularly when dealing with incomplete information. The uncertainty inherent in user-agent interactions poses challenges for aligning Large Language Models (LLMs), which we model as a Hierarchical Markov Decision Process (H-MDP). Conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment, while Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context. To address these issues, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to enhance decision-making efficiency and accuracy in multi-turn dialogues.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等