通过自动数据生成和细粒度评估扩展网络代理训练

📄 中文摘要

该研究提出了一种可扩展的管道,用于自动生成高质量的网络代理训练数据。识别高质量训练实例的主要挑战在于轨迹评估,即量化任务完成的进展程度。研究引入了一种新颖的基于约束的评估框架,提供了对任务完成进展的细粒度评估。这一方法使得能够利用部分成功的轨迹,从而显著扩展可用的训练数据量。研究在一个新提出的基准测试上评估了该方法,名为BookingArena,该基准测试包含20个流行网站上的复杂预订任务,结果表明,提炼后的学生模型优于开源模型。

📄 English Summary

Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

A scalable pipeline for automatically generating high-quality training data for web agents is presented. A major challenge in identifying high-quality training instances is trajectory evaluation, which quantifies the progress made towards task completion. A novel constraint-based evaluation framework is introduced, providing fine-grained assessment of progress towards task completion. This enables the leverage of partially successful trajectories, significantly expanding the amount of usable training data. The method is evaluated on a new benchmark called BookingArena, which consists of complex booking tasks across 20 popular websites, demonstrating that the distilled student model outperforms open-source models.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等