📄 中文摘要
基准测试在机器学习和现代人工智能系统中一直是基础性实践,尤其是在大型语言模型中,共享任务、指标和排行榜为衡量进展和比较方法提供了共同基础。随着人工智能系统在更广泛和更重要的环境中应用,补充这些既定实践以更全面的评估概念变得愈加重要。认识到这些系统所处的社会技术背景,提供了一个更深入的视角,探讨多个利益相关者及其独特优先事项如何影响我们对有意义或理想模型行为的理解。该研究提出了一种新的框架,以适应不同利益相关者的需求,推动基准测试方法的进化。
📄 English Summary
A Theoretical Framework for Adaptive Utility-Weighted Benchmarking
Benchmarking has long been a foundational practice in machine learning and increasingly in modern AI systems, such as large language models, where shared tasks, metrics, and leaderboards provide a common basis for measuring progress and comparing approaches. As AI systems are deployed in more varied and consequential settings, there is a growing need to complement these established practices with a more holistic conceptualization of evaluation. Recognizing the sociotechnical contexts in which these systems operate invites a deeper understanding of how multiple stakeholders and their unique priorities may inform what constitutes meaningful or desirable model behavior. This research introduces a new framework aimed at adapting benchmarking methods to better meet the needs of diverse stakeholders.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等