强化学习真的能提升大型聊天模型的推理能力吗?

📄 中文摘要

新训练方法声称能够提升聊天模型的智能水平,但究竟是改变了模型的思维方式,还是仅仅在旧技能上进行打磨?研究中测试了一种名为RLVR的方法,主要应用于数学、编码和视觉难题,初步结果显示在单次尝试时模型表现有所改善。然而,当允许模型多次尝试时,基模型的表现反而更好,显示出RLVR带来的提升非常有限。这表明额外的技能往往源于模型已有的学习,而非新的问题解决能力,覆盖面和响应多样性也受到限制,结果显示出一定的上限。另一个角度是蒸馏技术的应用。

📄 English Summary

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyondthe Base Model?

New training methods claim to enhance the intelligence of chat models, but it remains unclear whether they genuinely alter the models' thinking processes or merely refine existing skills. A method called RLVR was tested on mathematical, coding, and visual puzzles, showing initial improvements, especially when the model had only one attempt. However, when the models were allowed multiple tries, the base model outperformed the RLVR models, indicating that the improvements were quite limited. This suggests that the additional skills often stem from what the model has already learned rather than from new problem-solving capabilities. The coverage and variety of responses were also restricted, revealing a ceiling effect. Another aspect considered is the application of distillation techniques.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等