基于证据的亚专业推理:评估2025年内分泌学委员会风格考试中的临床智能层

📄 中文摘要

大型语言模型在一般医学考试中表现出色,但由于快速发展的指南和复杂的证据层级,亚专业临床推理仍然具有挑战性。研究评估了名为January Mirror的基于证据的临床推理系统,该系统在120道内分泌学委员会风格考试中与前沿的语言模型(GPT-5、GPT-5.2、Gemini-3-Pro)进行了比较。Mirror结合了经过筛选的内分泌学和心脏代谢证据库以及结构化推理架构,以生成与证据相关的输出。Mirror在闭合证据约束下操作,未进行外部检索,而对比的语言模型则可以实时访问指南和原始文献。

📄 English Summary

Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

Large language models have shown strong performance on general medical examinations; however, subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. This study evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. It operated under a closed-evidence constraint without external retrieval, while comparator LLMs had real-time web access to guidelines and primary literature.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等