HumanMCP:用于评估MCP工具检索性能的人类查询数据集

📄 中文摘要

MCP服务器包含成千上万的开源标准化工具,将大型语言模型(LLM)与外部系统连接。然而,现有的数据集和基准缺乏现实的人类用户查询,导致在评估MCP服务器的工具使用和生态系统时存在重要空白。现有数据集虽然包含工具描述,但未能真实反映不同用户的请求方式,导致泛化能力差和某些基准的可靠性被夸大。该研究提出了首个大规模MCP数据集,包含多样化、高质量的用户查询,专门针对308个MCP服务器上的2800个工具进行生成,基于MCP Zero数据集进行开发。每个工具都配有多个用户查询,以增强数据集的多样性和实用性。

📄 English Summary

HumanMCP: A Human-Like Query Dataset for Evaluating MCP Tool Retrieval Performance

Model Context Protocol (MCP) servers host thousands of open-source standardized tools that connect large language models (LLMs) to external systems. However, existing datasets and benchmarks lack realistic, human-like user queries, creating a significant gap in evaluating the tool usage and ecosystems of MCP servers. While current datasets include tool descriptions, they fail to accurately represent the diverse ways different users articulate their requests, resulting in poor generalization and inflated reliability of certain benchmarks. This research introduces the first large-scale MCP dataset featuring a variety of high-quality user queries specifically generated to correspond with 2800 tools across 308 MCP servers, building upon the MCP Zero dataset. Each tool is associated with multiple user queries to enhance the dataset's diversity and applicability.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等