ResearchGym:在真实世界AI研究中评估语言模型代理
📄 中文摘要
ResearchGym是一个用于评估AI代理在端到端研究中的基准和执行环境。该环境基于五篇来自ICML、ICLR和ACL的口头和聚光灯论文进行构建。从每篇论文的代码库中保留数据集、评估工具和基线实现,但不包括论文中提出的方法。这导致形成五个容器化任务环境,总共包含39个子任务。在每个环境中,代理需要提出新假设、进行实验,并尝试在论文的指标上超越强大的人工基线。在对一个由GPT-5驱动的代理的控制评估中,观察到能力与可靠性之间存在明显差距。该代理在提供的基线上取得了显著改进。
📄 English Summary
ResearchGym: Evaluating Language Model Agents on Real-World AI Research
ResearchGym is introduced as a benchmark and execution environment designed for evaluating AI agents in end-to-end research scenarios. This framework repurposes five oral and spotlight papers from ICML, ICLR, and ACL. It retains the datasets, evaluation harnesses, and baseline implementations from each paper's repository while withholding the proposed methods. This results in five containerized task environments encompassing a total of 39 sub-tasks. Within these environments, agents are required to propose novel hypotheses, conduct experiments, and strive to surpass robust human baselines based on the metrics established in the papers. A controlled evaluation of an agent powered by GPT-5 reveals a significant capability-reliability gap, with the agent demonstrating substantial improvements over the provided baselines.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等