图表理解的多模态信息融合:MLLMs 的演变、局限性与认知增强
📄 中文摘要
图表理解是一个典型的信息融合任务,要求无缝整合图形和文本数据以提取意义。多模态大语言模型(MLLMs)的出现彻底改变了这一领域,但基于MLLM的图表分析现状仍然零散,缺乏系统的组织。本研究提供了这一新兴前沿的全面路线图,结构化了该领域的核心组成部分。首先,分析了在图表中融合视觉和语言信息的基本挑战。接着,对下游任务和数据集进行了分类,提出了一种新的典范和非典范基准的分类法,以突出该领域不断扩展的范围。最后,呈现了对现有研究的综合评估,指出了未来的研究方向和潜在的认知增强方法。
📄 English Summary
Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement
Chart understanding is a quintessential information fusion task that requires the seamless integration of graphical and textual data to extract meaning. The emergence of Multimodal Large Language Models (MLLMs) has revolutionized this domain; however, the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap for this nascent frontier by structuring the core components of the domain. It begins by analyzing the fundamental challenges of fusing visual and linguistic information in charts. Subsequently, it categorizes downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the expanding scope of the field. Finally, a comprehensive evaluation of existing research is presented, identifying future research directions and potential cognitive enhancement methods.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等