VeRA:大规模验证推理数据增强

📄 中文摘要

当前大多数评估方案的主要问题在于其“静态”特性:同样的问题被反复使用,导致记忆、格式利用和最终的饱和。为了衡量真正的人工智能进展,需要构建稳健的评估,而不是事后检测。为此,提出了VeRA(验证推理数据增强)框架,该框架将基准问题转换为可执行的规范,包含(i)带占位符的自然语言模板,(ii)一个生成器用于采样有效配置,以及(iii)一个确定性验证器用于验证参数并计算每个配置的正确答案。通过一个种子问题,VeRA能够自动生成多种新的评估问题,从而提高评估的多样性和有效性。

📄 English Summary

VeRA: Verified Reasoning Data Augmentation at Scale

The main issue with most evaluation schemes today is their static nature, where the same problems are reused repeatedly, leading to memorization, format exploitation, and eventual saturation. To measure genuine AI progress, robust evaluation by design is necessary rather than post-hoc detection. In response, the VeRA (Verified Reasoning Data Augmentation) framework is proposed, which converts benchmark problems into executable specifications. This includes (i) a natural language template with placeholder slots, (ii) a coherent generator that samples valid configurations, and (iii) a deterministic verifier that validates parameters and calculates the corresponding correct answers for each configuration. From a single seed problem, VeRA can automatically generate a variety of new evaluation problems, enhancing the diversity and effectiveness of assessments.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等