ItinBench:基于大型语言模型的多认知维度规划基准测试
📄 中文摘要
大型语言模型(LLMs)凭借其先进的认知能力,正逐渐成为各种推理和规划任务的代理。传统评估通常集中于受控环境中的特定推理或规划问题。最近的研究探索了旅行规划,作为将各种语言推理任务整合到现实世界背景中的一种方式。然而,推理任务不仅限于语言推理,全面评估LLMs需要一个涵盖多个认知领域任务的测试平台。为此,ItinBench应运而生,它在旅行行程规划中引入了一项空间推理任务,即路线优化,同时保持传统的语言推理任务。
📄 English Summary
ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models
Large language models (LLMs) with advanced cognitive capabilities are emerging as agents for various reasoning and planning tasks. Traditional evaluations often focus on specific reasoning or planning questions within controlled environments. Recent studies have explored travel planning as a medium to integrate various verbal reasoning tasks into real-world contexts. However, reasoning tasks extend beyond verbal reasoning alone, and a comprehensive evaluation of LLMs requires a testbed that incorporates tasks from multiple cognitive domains. To address this gap, ItinBench is introduced, featuring a task of spatial reasoning, specifically route optimization, integrated into trip itinerary planning while maintaining traditional verbal reasoning tasks.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等