超越准确性:多模态医学推理中的视觉定位评估

📄 中文摘要

近期研究表明,使用可验证奖励的文本强化学习(RLVR)在多模态医学视觉问答(VQA)基准测试中能够与图像-文本强化学习(RLVR)相匹配或超越,暗示当前评估协议可能未能有效衡量因果视觉依赖性。研究提出了一种反事实评估框架,利用真实、空白和打乱的图像,针对四个医学VQA基准:PathVQA、PMC-VQA、SLAKE和VQA-RAD进行评估。除了准确性外,还测量了视觉依赖评分(VRS)、图像敏感性(IS),并引入了幻觉视觉推理率(HVRR),以检测模型在生成图像不变答案时产生视觉声明的情况。研究结果显示,RLVR提高了准确性,但削弱了视觉定位能力:文本强化学习在准确性上表现优异。

📄 English Summary

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

Recent research indicates that text-only reinforcement learning with verifiable rewards (RLVR) can match or outperform image-text RLVR on multimodal medical visual question answering (VQA) benchmarks, suggesting that current evaluation protocols may fail to adequately measure causal visual dependence. A counterfactual evaluation framework is proposed, utilizing real, blank, and shuffled images across four medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. Beyond accuracy, the study measures Visual Reliance Score (VRS), Image Sensitivity (IS), and introduces Hallucinated Visual Reasoning Rate (HVRR) to identify instances where models generate visual claims despite producing image-invariant answers. Findings reveal that while RLVR enhances accuracy, it degrades visual grounding, with text-only RLVR achieving superior accuracy.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等