可控探索的混合策略强化学习与可验证奖励在多模态推理中的应用

出处: Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

发布: 2026年2月25日

📄 中文摘要

强化学习与可验证奖励（RLVR）已成为提升多模态大语言模型（MLLM）推理能力的主要学习范式。然而，在强化学习训练过程中，MLLM 的庞大状态空间和稀疏奖励常常导致熵崩溃、策略退化或对次优行为的过度利用。因此，需要一种探索策略，既能保持有效的随机性，又能避免无控制随机采样带来的低效探索。CalibRL 是一种混合策略 RLVR 框架，支持在专家指导下的可控探索，依赖于两个关键机制。首先，采用分布感知的优势加权缩放，使得探索过程更加高效。

🏷️ 相关标签

#可控探索 #混合策略 #强化学习 #多模态推理 #可验证奖励

📄 English Summary

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of Multi-Modal Large Language Models (MLLMs). However, during RL training, the vast state space of MLLMs and sparse rewards often lead to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the inefficiencies of uncontrolled random sampling. CalibRL is proposed as a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, it employs distribution-aware advantage weighting scaling to enhance the efficiency of the exploration process.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误