儿童智力测试对多模态大型语言模型的挑战？KidGym：一个基于2D网格的推理基准

出处: Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

发布: 2026年3月24日

📄 中文摘要

多模态大型语言模型（MLLMs）结合了语言模型的语言优势和处理多模态数据的能力，使其能够解决更广泛的视觉任务。由于MLLMs旨在实现比仅语言模型更一般化的人类能力，研究灵感来源于韦氏智力量表，这是一种通过可解释和可测试的能力来评估儿童智力的成熟工具。KidGym是一个全面的2D网格基准，旨在评估MLLMs的五项基本能力：执行、感知推理、学习、记忆和规划。该基准包含12个独特任务，每个任务针对至少一项核心能力，专门设计用于评估MLLMs的适应性。

🏷️ 相关标签

#多模态大型语言模型 #儿童智力测试 #推理基准 #能力评估 #KidGym

📄 English Summary

Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

Multimodal Large Language Models (MLLMs) integrate the linguistic strengths of LLMs with the capability to process multimodal data, enabling them to tackle a wider array of visual tasks. Inspired by the Wechsler Intelligence Scales, which evaluate children's intelligence by breaking it down into interpretable and testable abilities, this study introduces KidGym, a comprehensive 2D grid-based benchmark designed to assess five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory, and Planning. The benchmark consists of 12 unique tasks, each targeting at least one core capability, specifically crafted to gauge the adaptability of MLLMs.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误