LiveMedBench:无污染的医疗基准测试,用于自动化评估大型语言模型
📄 中文摘要
针对大型语言模型(LLMs)在高风险临床环境中的应用,提出了LiveMedBench,这是一种持续更新、无污染的医疗基准测试。现有的医疗基准存在数据污染和时间错位等问题,导致性能评估不准确。同时,当前的评估指标往往依赖于表面词汇重叠或主观评分,无法有效验证临床正确性。LiveMedBench旨在解决这些问题,通过自动化评分标准,确保评估的可靠性和有效性,从而推动LLMs在医疗领域的应用。
📄 English Summary
LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation
The deployment of Large Language Models (LLMs) in high-stakes clinical environments necessitates a robust and reliable evaluation framework. Existing medical benchmarks face significant challenges, including data contamination, where test sets inadvertently influence training datasets, and temporal misalignment, which fails to account for the rapid advancements in medical knowledge. Additionally, current evaluation metrics for open-ended clinical reasoning often rely on superficial lexical overlap or subjective scoring methods, which are insufficient for ensuring clinical accuracy. LiveMedBench addresses these issues by providing a continuously updated, contamination-free medical benchmark with automated rubric evaluation, thereby enhancing the reliability and validity of LLM assessments in the medical domain.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等