利用计算机自适应测试进行大语言模型在医学基准评估中的成本效益评估

📄 中文摘要

随着大型语言模型(LLMs)在医疗领域的迅速普及,迫切需要可扩展且具有心理测量学基础的评估方法。传统的静态基准测试在重复实施时成本高昂,易受数据污染影响,且缺乏细致性能跟踪所需的校准测量特性。研究提出并验证了一种基于项目反应理论(IRT)的计算机自适应测试(CAT)框架,以高效评估LLMs的标准化医学知识。研究设计分为两个阶段:首先进行蒙特卡罗模拟以确定最佳CAT配置,其次使用经过人工校准的医学题库对38个LLMs进行实证评估。每个模型均完成了完整的题库测试。

📄 English Summary

Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking

The rapid proliferation of large language models (LLMs) in healthcare necessitates scalable and psychometrically sound evaluation methods. Conventional static benchmarks are costly to administer repeatedly, susceptible to data contamination, and lack calibrated measurement properties for fine-grained performance tracking. This study proposes and validates a computerized adaptive testing (CAT) framework grounded in item response theory (IRT) for the efficient assessment of standardized medical knowledge in LLMs. The research comprises a two-phase design: a Monte Carlo simulation to identify optimal CAT configurations and an empirical evaluation of 38 LLMs using a human-calibrated medical item bank. Each model completed both the full item bank and the adaptive testing process.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等