DSGym:评估和训练数据科学智能体的综合框架

📄 中文摘要

数据科学智能体旨在通过将数据转化为可执行分析和发现,从而加速发现和洞察生成。然而,现有数据科学基准存在评估接口碎片化、任务覆盖范围狭窄以及缺乏严格数据基础等缺陷,导致跨基准比较困难。具体而言,DSGym框架展示了当前基准中相当一部分任务无需使用实际数据即可解决。DSGym提出了一种新的范式,旨在克服这些局限性,提供一个统一且全面的评估和训练环境。其核心思想是构建一个模拟真实数据科学工作流的平台,智能体在此平台上能够执行从数据获取、清洗、探索性数据分析(EDA)、特征工程、模型选择、训练、评估到结果解释和报告生成等一系列任务。DSGym通过引入多种真实世界数据集和任务类型,确保了评估的严谨性和泛化能力。框架特别强调了“数据接地”原则,即智能体的决策和分析必须严格依赖于所提供的数据,而非通过模式匹配或预设规则。为此,DSGym设计了创新的评估指标,不仅衡量最终结果的准确性,还评估智能体在数据处理和分析过程中对数据本身的理解和利用程度。此外,DSGym还提供了一套模块化的训练环境,允许研究人员和开发者根据特定需求定制智能体模型和训练策略。框架支持多种编程语言和数据科学库,旨在促进社区合作和基准的持续改进。通过DSGym,期望能够推动数据科学智能体领域的发展,使其能够更有效地应对现实世界中复杂且多变的数据挑战,最终实现数据驱动的自动化科学发现。

📄 English Summary

DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

Data science agents hold significant promise for accelerating discovery and insight generation by transforming raw data into actionable analyses and findings. However, existing data science benchmarks suffer from fragmented evaluation interfaces, making cross-benchmark comparisons challenging, limited task coverage, and a notable lack of rigorous data grounding. Specifically, DSGym demonstrates that a substantial portion of tasks in current benchmarks can be solved without actual data utilization. DSGym proposes a novel paradigm to overcome these limitations, offering a unified and comprehensive evaluation and training environment. The core concept involves establishing a platform that simulates real-world data science workflows, enabling agents to perform a spectrum of tasks ranging from data acquisition, cleaning, exploratory data analysis (EDA), feature engineering, model selection, training, evaluation, to result interpretation and report generation. By incorporating a diverse array of real-world datasets and task types, DSGym ensures the rigor and generalization capability of its evaluations. The framework particularly emphasizes the principle of 'data grounding,' where an agent's decisions and analyses must strictly depend on the provided data, rather than relying on pattern matching or predefined rules. To achieve this, DSGym introduces innovative evaluation metrics that not only assess the accuracy of final outcomes but also gauge the agent's understanding and utilization of the data itself throughout the processing and analysis pipeline. Furthermore, DSGym provides a modular training environment, allowing researchers and developers to customize agent models and training strategies according to specific requirements. The framework supports multiple programming languages and data science libraries, aiming to foster community collaboration and continuous benchmark improvement. Through DSGym, the objective is to advance the field of data science agents, enabling them to more effectively tackle complex and dynamic real-world data challenges, ultimately achieving data-driven automated scientific discovery.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等