📄 中文摘要
ChartDiff 是首个针对跨图表比较总结的大规模基准,旨在填补现有图表理解基准在多图表比较推理方面的空白。该基准包含 8,541 对图表,涵盖多种数据源、图表类型和视觉风格,每对图表均附有 LLM 生成和人工验证的摘要,描述趋势、波动和异常的差异。通过使用 ChartDiff,评估了通用模型、专门针对图表的模型以及基于管道的模型。结果表明,前沿的通用模型在 GPT 基础质量上表现最佳,而专门模型的性能也得到了验证。
📄 English Summary
ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
ChartDiff introduces the first large-scale benchmark for cross-chart comparative summarization, addressing the gap in existing benchmarks that focus primarily on single-chart interpretation. It consists of 8,541 chart pairs from diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries that describe differences in trends, fluctuations, and anomalies. The benchmark enables the evaluation of general-purpose, chart-specialized, and pipeline-based models. Results indicate that frontier general-purpose models achieve the highest quality based on GPT metrics, while specialized models also demonstrate validated performance.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等