用自辩论强化学习为多智能体辩论准备推理语言模型

📄 中文摘要

大型语言模型(LLMs)的推理能力已通过可验证奖励强化学习(RLVR)得到了显著提升。在测试阶段,通过多智能体辩论(MAD)进行的协作推理已成为提高LLM性能的一种有前景的方法。然而,当前的RLVR方法通常训练LLMs独立解决问题,并未明确地让它们准备好从不同推理中进行综合和受益。为了弥合这一差距,一种名为“自辩论强化学习”(Self-Debate Reinforcement Learning, SDRL)的新方法被提出,旨在通过模拟内部辩论过程来增强LLM的推理能力,使其更好地适应多智能体协作环境。

📄 English Summary

Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning

The reasoning capabilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales. To bridge this gap, a novel approach called “Self-Debate Reinforcement Learning” (SDRL) is proposed, aiming to enhance LLM reasoning abilities by simulating an internal debate process, thereby better adapting them to multi-agent collaborative environments. The core idea of SDRL is to enable the LLM to play multiple roles during training, generating various reasoning paths and solutions for the same problem, and optimizing its final reasoning outcome through internal evaluation and critique. This is analogous to how humans engage in self-reflection and multi-perspective thinking when facing complex problems. Specifically, SDRL establishes a cyclical feedback mechanism where the LLM generates initial answers and supporting arguments, then generates counter-arguments or alternative solutions, and finally synthesizes this information to arrive at a more robust and accurate conclusion. Each step in this process is evaluated and rewarded, progressively enhancing the LLM's capabilities in debating and synthetic reasoning. By introducing Debate Reward, SDRL encourages the model to generate diverse and persuasive arguments and rebuttals, and rewards its ability to reach higher-quality consensus after synthesizing different viewpoints.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等