📄 中文摘要
大型语言模型在临床数据抽象任务中对提示词措辞高度敏感,但现有多数研究将提示词视为固定不变,并孤立地探讨不确定性。本工作主张应将提示词敏感性与不确定性联合考虑。通过在两个临床任务(MedAlign适用性/正确性和多发性硬化症亚型抽象)以及多个开源和专有模型上进行实验,量化了提示词敏感性,具体通过翻转率来衡量,并将其与校准和选择性预测关联起来。研究发现,高提示词敏感性通常与较差的校准性能以及通过选择性预测实现性能提升的潜力相关。基于这些观察,提出了一种稳定性感知提示优化方法,该方法通过生成和评估提示词的多个变体来识别出对模型输出影响最小的提示词。
📄 English Summary
Stability-Aware Prompt Optimization for Clinical Data Abstraction
Large language models, when applied to clinical data abstraction, exhibit significant sensitivity to prompt wording. However, most existing research treats prompts as static entities and investigates uncertainty in isolation. This work argues for a joint consideration of prompt sensitivity and uncertainty. Across two distinct clinical tasks—MedAlign applicability/correctness and multiple sclerosis subtype abstraction—and utilizing multiple open-source and proprietary models, prompt sensitivity is quantified through flip rates. This sensitivity is then correlated with model calibration and the potential for performance improvement via selective prediction. Findings indicate that high prompt sensitivity is frequently associated with poorer calibration performance and a greater potential for performance gains through selective prediction. Building upon these observations, a stability-aware prompt optimization methodology is proposed. This method involves generating and evaluating multiple prompt variants to identify those that minimally impact model outputs. Such an approach not only discovers more stable prompts, thereby enhancing model robustness across different prompt formulations, but also improves calibration by more accurately identifying model uncertainty. By optimizing prompts, the models are able to produce more consistent and reliable outputs when faced with similar yet differently worded instructions. Furthermore, this method contributes to a better understanding of model limitations in specific clinical tasks and offers more robust strategies for deploying language models in clinical applications. Experimental results demonstrate the significant effectiveness of this approach in improving the accuracy and reliability of clinical data abstraction tasks, providing a novel optimization pathway for the practical application of large language models in the medical domain.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等