可控探索的混合策略强化学习与可验证奖励在多模态推理中的应用

📄 中文摘要

强化学习与可验证奖励(RLVR)已成为提升多模态大语言模型(MLLM)推理能力的主要学习范式。然而,在强化学习训练过程中,MLLM 的庞大状态空间和稀疏奖励常常导致熵崩溃、策略退化或对次优行为的过度利用。因此,需要一种探索策略,既能保持有效的随机性,又能避免无控制随机采样带来的低效探索。CalibRL 是一种混合策略 RLVR 框架,支持在专家指导下的可控探索,依赖于两个关键机制。首先,采用分布感知的优势加权缩放,使得探索过程更加高效。

📄 English Summary

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of Multi-Modal Large Language Models (MLLMs). However, during RL training, the vast state space of MLLMs and sparse rewards often lead to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the inefficiencies of uncontrolled random sampling. CalibRL is proposed as a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, it employs distribution-aware advantage weighting scaling to enhance the efficiency of the exploration process.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等