DeReason:一种基于难度感知的课程改进了解耦的SFT-再强化学习训练以增强通用推理能力

📄 中文摘要

强化学习与可验证奖励(RLVR)已成为激发大型语言模型推理能力的强大范式,尤其在数学和编程领域。尽管近期努力将这一范式扩展到更广泛的科学(STEM)领域,但在这些背景下,监督微调(SFT)与RL之间复杂的相互作用仍未得到充分探讨。控制实验揭示了一个关键挑战:在通用STEM领域,直接对基础模型应用RL的样本效率极低,且在中等质量响应上始终被SFT所超越。然而,依次进行SFT和RL的训练可以进一步提升性能,表明两者的结合具有潜在优势。

📄 English Summary

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for eliciting reasoning capabilities in large language models, particularly in mathematics and coding. Recent efforts have extended this paradigm to broader scientific (STEM) domains, yet the complex interplay between supervised fine-tuning (SFT) and RL in these contexts remains underexplored. Controlled experiments reveal a critical challenge: applying RL directly to base models in general STEM domains is highly sample-inefficient and consistently outperformed by SFT on moderate-quality responses. However, sequential SFT followed by RL can further enhance performance, suggesting a potential advantage in combining these approaches.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等