儿童智力测试对多模态大型语言模型的挑战?KidGym:一个基于2D网格的推理基准

📄 中文摘要

多模态大型语言模型(MLLMs)结合了语言模型的语言优势和处理多模态数据的能力,使其能够解决更广泛的视觉任务。由于MLLMs旨在实现比仅语言模型更一般化的人类能力,研究灵感来源于韦氏智力量表,这是一种通过可解释和可测试的能力来评估儿童智力的成熟工具。KidGym是一个全面的2D网格基准,旨在评估MLLMs的五项基本能力:执行、感知推理、学习、记忆和规划。该基准包含12个独特任务,每个任务针对至少一项核心能力,专门设计用于评估MLLMs的适应性。

📄 English Summary

Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

Multimodal Large Language Models (MLLMs) integrate the linguistic strengths of LLMs with the capability to process multimodal data, enabling them to tackle a wider array of visual tasks. Inspired by the Wechsler Intelligence Scales, which evaluate children's intelligence by breaking it down into interpretable and testable abilities, this study introduces KidGym, a comprehensive 2D grid-based benchmark designed to assess five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory, and Planning. The benchmark consists of 12 unique tasks, each targeting at least one core capability, specifically crafted to gauge the adaptability of MLLMs.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等