📄 中文摘要
针对结构化电子健康记录(EHR)的基础模型,通过对时间戳临床事件的纵向序列进行预训练,以学习适应性的患者表示。标记化的过程决定了这些时间线如何转换为离散的模型输入,影响信息的保留、编码的效率以及需要学习与预计算的关系。然而,标记化设计选择对下游性能和计算效率的影响尚未得到充分探讨。在这项研究中,基于儿科EHR数据,采用因子设计对变压器模型进行预训练,变更了事件编码、时间编码和工作流注释的标记化方式。通过评估接收者操作特征曲线下面积,分析了不同标记化策略的效果。
📄 English Summary
Tokenization Tradeoffs in Structured EHR Foundation Models
Foundation models for structured electronic health records (EHRs) are pretrained on longitudinal sequences of timestamped clinical events to learn adaptable patient representations. The process of tokenization determines how these timelines are converted into discrete model inputs, influencing what information is preserved, the efficiency of encoding, and the relationships that need to be learned versus precomputed. However, the impact of tokenization design choices on downstream performance and computational efficiency remains largely unexplored. This study pretrained a transformer model on pediatric EHR data using a factorial design, varying tokenization along event encoding, time encoding, and workflow annotation. The effects of different tokenization strategies were analyzed by evaluating the area under the receiver operating characteristic curve.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等