牛津大学32%的错误率:医疗大型语言模型究竟有多安全?

📄 中文摘要

一项与牛津大学相关的研究发现,大型语言模型在医疗摘要中产生临床不安全内容或幻觉的概率约为32%。这一缺陷并非微不足道,表明当前系统作为自主临床参与者的安全性不足。对于医疗领导者而言,核心问题在于:大型语言模型的失败频率、失败方式,以及治理和技术控制是否能够有效降低风险。研究指出,三分之一的临床问题输出排除了无监督的床边使用,但在严格控制的辅助工作流程中可能是可以接受的。

📄 English Summary

Oxford’s 32% Error Rate: How Safe Are Medical LLMs, Really?

A study affiliated with Oxford University found that large language models produce clinically unsafe content or hallucinations in approximately 32% of medical summaries. This is not a trivial flaw; it indicates that current systems are unsafe as autonomous clinical actors. For healthcare leaders, the key questions are how often LLMs fail, how they fail, and whether governance and technical controls can effectively mitigate the risks. The study highlights that a one-in-three chance of clinically problematic output rules out unsupervised bedside use, but may be acceptable in tightly controlled assistive workflows.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等