人工智能基准测试存在问题。我们需要什么替代方案

📄 中文摘要

人工智能的评估长期以来一直围绕机器是否超越人类进行。无论是国际象棋、高级数学、编程还是写作,人工智能模型和应用的表现都与个体人类完成任务的能力进行比较。这种比较方式具有吸引力,因为它在孤立的问题上提供了明确的标准。然而,这种方法存在局限性,未能全面反映人工智能的潜力和应用价值。为了更好地评估人工智能的能力,需要建立新的基准,考虑到更复杂的任务和真实世界的应用场景。

📄 English Summary

AI benchmarks are broken. Here’s what we need instead.

For decades, artificial intelligence has been evaluated based on whether machines can outperform humans. This evaluation spans various domains, including chess, advanced mathematics, programming, and essay writing, where AI models and applications are tested against individual human performance. This comparison is appealing because it provides clear standards for isolated problems. However, this approach has limitations and fails to capture the full potential and value of AI. To better assess AI capabilities, new benchmarks are needed that take into account more complex tasks and real-world applications.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等