📄 中文摘要
医学图像报告生成模型在应对多样的报告方案时面临挑战。UniRG(Unified Report Generation)提出了一种创新的多模态强化学习框架,旨在显著提升医学视觉-语言模型的性能和泛化能力。该框架的核心在于将报告生成过程建模为一个序列决策问题,并利用强化学习的反馈机制来优化模型输出。具体而言,UniRG通过设计多模态奖励函数,综合考虑报告的医学准确性、流畅性、完整性以及与图像内容的匹配度。该奖励函数结合了基于结构化医学知识的硬性约束和基于大规模文本语料库的软性语义匹配度。模型架构上,UniRG采用了一个编码器-解码器结构,其中编码器负责从医学图像中提取多尺度、多模态的视觉特征,并从电子健康记录(EHR)中获取患者临床信息作为辅助。解码器则是一个基于Transformer的自回归模型,生成结构化或非结构化的医学报告文本。在训练过程中,UniRG首先进行预训练以学习基础的视觉-语言对应关系,然后通过策略梯度方法(如REINFORCE或Actor-Critic)进行强化学习微调。这种分阶段训练策略使得模型能够更好地适应不同医疗机构和疾病领域特有的报告风格和术语。实验结果表明,UniRG在多个公开的医学图像报告数据集上,包括胸部X光、CT扫描等,均取得了显著优于现有基线模型的表现,尤其在生成高质量、临床相关的报告方面具有明显优势。其生成的报告在BLEU、ROUGE和CIDEr等自动化指标上表现出色,并且经过放射科医生盲评,在临床实用性和准确性方面获得了高度评价。UniRG的成功之处在于其能够有效利用多模态信息,并通过强化学习的迭代优化,克服了传统监督学习方法在处理开放域、多样化报告方案时的局限性,为未来医学图像报告自动化生成提供了可扩展且鲁棒的解决方案。
📄 English Summary
UniRG: Scaling medical imaging report generation with multimodal reinforcement learning
Medical image report generation models face significant challenges in adapting to diverse reporting schemes. UniRG (Unified Report Generation) introduces an innovative multimodal reinforcement learning framework designed to substantially enhance the performance and generalization capabilities of medical vision-language models. The core of this framework lies in modeling the report generation process as a sequential decision-making problem, utilizing reinforcement learning's feedback mechanisms to optimize model outputs. Specifically, UniRG designs a multimodal reward function that comprehensively considers the medical accuracy, fluency, completeness, and alignment with image content of the generated reports. This reward function integrates hard constraints derived from structured medical knowledge with soft semantic matching scores based on large-scale text corpora. Architecturally, UniRG employs an encoder-decoder structure, where the encoder extracts multi-scale, multimodal visual features from medical images and incorporates patient clinical information from Electronic Health Records (EHR) as auxiliary input. The decoder is a Transformer-based auto-regressive model that generates structured or unstructured medical report text. During training, UniRG first undergoes pre-training to learn fundamental vision-language correspondences, followed by reinforcement learning fine-tuning using policy gradient methods (e.g., REINFORCE or Actor-Critic). This phased training strategy allows the model to better adapt to the specific reporting styles and terminology unique to different medical institutions and disease domains. Experimental results demonstrate that UniRG achieves significantly superior performance compared to existing baseline models across various public medical image reporting datasets, including chest X-rays and CT scans, particularly excelling in generating high-quality, clinically relevant reports. Its generated reports perform remarkably well on automated metrics such as BLEU, ROUGE, and CIDEr, and have received high ratings from blind evaluations by radiologists regarding clinical utility and accuracy. UniRG's success stems from its ability to effectively leverage multimodal information and iteratively optimize through reinforcement learning, overcoming the limitations of traditional supervised learning methods in handling open-domain, diverse reporting schemes, thereby providing a scalable and robust solution for future automated medical image report generation.