Doctorina MedBench:基于代理的医疗人工智能的端到端评估
📄 中文摘要
Doctorina MedBench是一个全面的评估框架,旨在模拟真实的医患互动,以评估基于代理的医疗人工智能。与传统的医疗基准测试依赖于解决标准化测试问题不同,该方法建模了一个多步骤的临床对话。在这一过程中,医生或人工智能系统需要收集病史、分析附加材料(包括实验室报告、图像和医疗文件)、制定鉴别诊断并提供个性化建议。系统性能通过D.O.T.S.指标进行评估,该指标由四个组成部分构成:诊断、观察/检查、治疗和步骤计数,从而能够全面评估临床沟通的有效性。
📄 English Summary
Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI
Doctorina MedBench presents a comprehensive evaluation framework designed for agent-based medical AI, focusing on simulating realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on standardized test questions, this approach models a multi-step clinical dialogue where either a physician or an AI system must gather medical history, analyze supplementary materials (including lab reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is assessed using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling a thorough evaluation of clinical communication effectiveness.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等