MedArena:比较医学领域临床医生偏好的大型语言模型

📄 中文摘要

大型语言模型(LLMs)在临床工作流程中日益重要,涉及临床决策支持、医学教育和患者沟通。然而,当前针对医学LLMs的评估方法主要依赖静态的、模板化的基准测试,这些方法未能有效捕捉真实临床实践的复杂性和动态性,从而导致基准性能与临床实用性之间的脱节。为了解决这些局限性,提出了MedArena,这是一个互动评估平台,允许临床医生使用自己的医学查询直接测试和比较领先的LLMs。在用户提供查询后,MedArena会展示两个随机选择模型的响应,并要求用户选择更优的响应。该平台旨在提高LLMs在实际临床环境中的应用效果。

📄 English Summary

MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

Large language models (LLMs) are becoming increasingly integral to clinician workflows, encompassing clinical decision support, medical education, and patient communication. Current evaluation methods for medical LLMs predominantly rely on static, templated benchmarks that fail to adequately capture the complexity and dynamics of real-world clinical practice, leading to a disconnect between benchmark performance and clinical utility. To address these limitations, MedArena is introduced as an interactive evaluation platform that allows clinicians to directly test and compare leading LLMs using their own medical queries. Upon receiving a clinician-provided query, MedArena presents responses from two randomly selected models and prompts the user to select the preferred response. This platform aims to enhance the applicability of LLMs in actual clinical settings.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等