代币游戏:通过难题对决评估语言模型的推理能力

📄 中文摘要

随着大型语言模型的不断进步,评估其推理能力变得愈加困难。人类精心策划的难题成本高昂,尤其是在使用博士级领域知识的最新基准中,以挑战最强大的模型。即便如此,仍然存在一个问题,即这些问题是否真正测试了推理能力,或者模型是否在训练过程中见过类似的问题。受到16世纪数学对决的启发,设计了代币游戏(TTG):一个评估框架,模型通过创建自己的难题进行相互挑战。该框架利用编程难题的形式——给定一个返回布尔值的Python函数,寻找使其返回True的输入——灵活地表示推理能力。

📄 English Summary

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

Evaluating the reasoning capabilities of large language models has become increasingly challenging as these models improve. The human curation of difficult questions is costly, especially in recent benchmarks that utilize PhD-level domain knowledge to challenge the most capable models. There remains a concern about whether these questions genuinely test reasoning skills or if similar problems were encountered during training. Inspired by 16th-century mathematical duels, The Token Games (TTG) is introduced as an evaluation framework where models challenge each other by creating their own puzzles. This framework leverages the format of programming puzzles—given a Python function that returns a boolean, find inputs that make it return True—to flexibly represent reasoning capabilities.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等