ItinBench：基于大型语言模型的多认知维度规划基准测试

出处: ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models

发布: 2026年3月23日

📄 中文摘要

大型语言模型（LLMs）凭借其先进的认知能力，正逐渐成为各种推理和规划任务的代理。传统评估通常集中于受控环境中的特定推理或规划问题。最近的研究探索了旅行规划，作为将各种语言推理任务整合到现实世界背景中的一种方式。然而，推理任务不仅限于语言推理，全面评估LLMs需要一个涵盖多个认知领域任务的测试平台。为此，ItinBench应运而生，它在旅行行程规划中引入了一项空间推理任务，即路线优化，同时保持传统的语言推理任务。

🏷️ 相关标签

#大型语言模型 #推理任务 #规划基准 #空间推理 #旅行规划

📄 English Summary

ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models

Large language models (LLMs) with advanced cognitive capabilities are emerging as agents for various reasoning and planning tasks. Traditional evaluations often focus on specific reasoning or planning questions within controlled environments. Recent studies have explored travel planning as a medium to integrate various verbal reasoning tasks into real-world contexts. However, reasoning tasks extend beyond verbal reasoning alone, and a comprehensive evaluation of LLMs requires a testbed that incorporates tasks from multiple cognitive domains. To address this gap, ItinBench is introduced, featuring a task of spatial reasoning, specifically route optimization, integrated into trip itinerary planning while maintaining traditional verbal reasoning tasks.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误