超越准确性：多模态医学推理中的视觉定位评估

出处: Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

发布: 2026年3月5日

📄 中文摘要

近期研究表明，使用可验证奖励的文本强化学习（RLVR）在多模态医学视觉问答（VQA）基准测试中能够与图像-文本强化学习（RLVR）相匹配或超越，暗示当前评估协议可能未能有效衡量因果视觉依赖性。研究提出了一种反事实评估框架，利用真实、空白和打乱的图像，针对四个医学VQA基准：PathVQA、PMC-VQA、SLAKE和VQA-RAD进行评估。除了准确性外，还测量了视觉依赖评分（VRS）、图像敏感性（IS），并引入了幻觉视觉推理率（HVRR），以检测模型在生成图像不变答案时产生视觉声明的情况。研究结果显示，RLVR提高了准确性，但削弱了视觉定位能力：文本强化学习在准确性上表现优异。

🏷️ 相关标签

#视觉定位 #多模态医学 #强化学习 #视觉问答 #因果依赖

📄 English Summary

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

Recent research indicates that text-only reinforcement learning with verifiable rewards (RLVR) can match or outperform image-text RLVR on multimodal medical visual question answering (VQA) benchmarks, suggesting that current evaluation protocols may fail to adequately measure causal visual dependence. A counterfactual evaluation framework is proposed, utilizing real, blank, and shuffled images across four medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. Beyond accuracy, the study measures Visual Reliance Score (VRS), Image Sensitivity (IS), and introduces Hallucinated Visual Reasoning Rate (HVRR) to identify instances where models generate visual claims despite producing image-invariant answers. Findings reveal that while RLVR enhances accuracy, it degrades visual grounding, with text-only RLVR achieving superior accuracy.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误