BabyReasoningBench：评估婴儿语言模型的发育启发式推理任务

出处: BabyReasoningBench: Generating Developmentally-Inspired Reasoning Tasks for Evaluating Baby Language Models

发布: 2026年1月29日

📄 中文摘要

传统的语言模型推理能力评估主要依赖以成人为中心的基准测试，这些测试预设了广泛的世界知识、复杂的指令遵循能力和成熟的语用能力。这些假设与通过发育学上合理输入（如儿童导向语言和早期儿童叙事）训练的婴儿语言模型不匹配，并且模糊了在这些约束下（如果存在的话）究竟出现了哪些推理能力。为解决这一问题，BabyReasoningBench引入了一套新的基准测试，旨在从发育角度评估婴儿语言模型的推理能力。这些任务的设计灵感来源于儿童认知发展研究，侧重于评估模型在有限知识和简化语言环境下的基本推理技能。

🏷️ 相关标签

#婴儿语言模型 #推理评估 #发育心理学 #儿童认知 #基准测试

📄 English Summary

BabyReasoningBench: Generating Developmentally-Inspired Reasoning Tasks for Evaluating Baby Language Models

Traditional evaluations of language model reasoning capabilities are predominantly based on adult-centric benchmarks, which inherently presuppose extensive world knowledge, complex instruction following, and mature pragmatic competence. These foundational assumptions are misaligned with baby language models, which are typically trained on developmentally plausible input such as child-directed speech and early-childhood narratives. Such misalignments obscure the specific reasoning abilities, if any, that emerge under these developmentally constrained conditions. To address this critical gap, BabyReasoningBench introduces a novel suite of benchmarks specifically designed to evaluate the reasoning capabilities of baby language models from a developmental perspective. These tasks draw inspiration from research in child cognitive development, focusing on assessing fundamental reasoning skills within contexts characterized by limited knowledge and simplified linguistic environments. Specifically, BabyReasoningBench encompasses a diverse range of tasks covering areas such as object permanence, causality, spatial reasoning, basic numerical concepts, and simple social cognition. Each task is meticulously crafted to avoid over-reliance on adult-level knowledge and complex language comprehension, thereby offering a more accurate reflection of the reasoning abilities attainable by baby language models following a learning trajectory analogous to human children. By employing these novel benchmarks, researchers can gain deeper insights into how different training data and architectural choices influence the reasoning performance of language models during their early developmental stages. Furthermore, BabyReasoningBench provides a robust framework for systematically exploring reasoning abilities across various developmental milestones, revealing both the limitations and potential of current models in simulating human child cognitive development.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

BabyReasoningBench: Generating Developmentally-Inspired Reasoning Tasks for Evaluating Baby Language Models

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误