DeepSearchQA:弥合深度研究代理全面性差距的基准

📄 中文摘要

DeepSearchQA是一个包含900个提示的基准测试,用于评估代理在17个不同领域中执行复杂多步信息检索任务的能力。与侧重于单一答案检索或广泛事实性验证的传统基准不同,DeepSearchQA数据集由具有挑战性、人工精心设计的任务组成,旨在评估代理执行复杂搜索计划以生成详尽答案列表的能力。这种设计上的转变明确地关注了代理在处理需要深入理解、多源信息整合和批判性分析的高难度查询时的表现。DeepSearchQA中的每个提示都要求代理不仅找到信息,还要对信息进行综合、筛选和组织,以满足用户对全面性和准确性的高要求。

📄 English Summary

DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

DeepSearchQA introduces a 900-prompt benchmark designed to evaluate agents on challenging multi-step information-seeking tasks across 17 diverse fields. Diverging from conventional benchmarks focused on single answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of difficult, handcrafted tasks specifically engineered to assess an agent's proficiency in executing complex search plans to generate exhaustive answer lists. This paradigm shift explicitly aims to measure an agent's capacity to handle intricate queries requiring deep comprehension, multi-source information synthesis, and critical analysis. Each prompt within DeepSearchQA demands that agents not only locate information but also synthesize, filter, and organize it to meet high user expectations for comprehensiveness and accuracy. The benchmark spans various disciplines, including science, technology, medicine, history, and art, ensuring a holistic evaluation of an agent's generalization capabilities and depth of expertise. By simulating real-world information retrieval challenges faced by researchers and analysts, DeepSearchQA seeks to advance the development of deep research agents capable of more effectively addressing open-domain, multi-hop, and context-rich complex questions. Its distinctive emphasis on 'comprehensiveness' means agents must not only identify correct answers but also all relevant and significant ones, presenting them in a structured and easily digestible format. This provides a novel evaluation standard and research direction for developing AI systems capable of more human-like complex information exploration and knowledge discovery.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等