📄 中文摘要
结合多模态视觉-语言模型(VLM)和大型语言模型(LLM)为医学分类带来了新的机遇。为系统评估这些先进技术,本研究构建了一个严格统一的基准,利用四个公开可用的数据集,涵盖文本和图像模态,并涉及二分类和多分类任务。该基准旨在对比传统机器学习(ML)方法与现代基于Transformer的技术在医学分类任务上的表现。在每个任务中,评估了三类模型:首先是传统机器学习模型,这些模型通常依赖于手工特征工程或浅层学习算法,例如支持向量机(SVM)、随机森林或逻辑回归,它们在特定领域仍具有一定的解释性和效率优势。其次是基于Transformer的预训练模型,如BERT、RoBERTa用于文本任务,或ViT、ResNet等用于图像任务,这些模型通常在大量数据上进行预训练,展现出强大的特征提取能力和泛化性能。最后是多模态融合模型,尤其是结合了VLM和LLM的方法,这些模型旨在通过整合不同模态的信息,例如将医学图像的视觉特征与临床文本的语言描述结合起来,以实现更全面、更准确的分类。通过在统一框架下对这些不同模型类别进行系统性比较,期望揭示各类模型在不同医学分类场景下的优势与劣势,从而指导研究人员和临床医生选择最适合特定任务的模型范式。研究结果不仅量化了基础模型相对于传统ML的性能提升,也探讨了其在不同数据模态和任务复杂性下的适用性边界,为未来医学AI模型的发展提供了关键洞察。
📄 English Summary
LLM is Not All You Need: A Systematic Evaluation of ML vs. Foundation Models for text and image based Medical Classification
The integration of multimodal Vision-Language Models (VLMs) and Large Language Models (LLMs) presents novel avenues for medical classification. To rigorously assess these advanced technologies, this work establishes a unified benchmark utilizing four publicly available datasets, encompassing both text and image modalities, and addressing both binary and multiclass classification complexities. The benchmark aims to contrast traditional Machine Learning (ML) methodologies with contemporary transformer-based techniques for medical classification tasks. For each task, three distinct model classes were evaluated: First, conventional machine learning models, typically relying on handcrafted feature engineering or shallow learning algorithms such as Support Vector Machines (SVMs), Random Forests, or Logistic Regression, which still offer advantages in interpretability and efficiency for specific domains. Second, transformer-based pre-trained models, such as BERT and RoBERTa for text tasks, or ViT and ResNet for image tasks, these models are usually pre-trained on vast amounts of data, demonstrating powerful feature extraction capabilities and generalization performance. Third, multimodal fusion models, particularly those combining VLM and LLM approaches, these models aim to achieve more comprehensive and accurate classification by integrating information from different modalities, for instance, combining visual features from medical images with linguistic descriptions from clinical text. Through a systematic comparison of these diverse model categories within a unified framework, the study seeks to elucidate the strengths and weaknesses of each model paradigm across various medical classification scenarios, thereby guiding researchers and clinicians in selecting the most appropriate model for specific tasks. The findings not only quantify the performance gains of foundation models over traditional ML but also explore their applicability boundaries across different data modalities and task complexities, providing critical insights for the future development of medical AI models.