结构化电子健康记录基础模型中的标记化权衡

出处: Tokenization Tradeoffs in Structured EHR Foundation Models

发布: 2026年3月18日

📄 中文摘要

针对结构化电子健康记录（EHR）的基础模型，通过对时间戳临床事件的纵向序列进行预训练，以学习适应性的患者表示。标记化的过程决定了这些时间线如何转换为离散的模型输入，影响信息的保留、编码的效率以及需要学习与预计算的关系。然而，标记化设计选择对下游性能和计算效率的影响尚未得到充分探讨。在这项研究中，基于儿科EHR数据，采用因子设计对变压器模型进行预训练，变更了事件编码、时间编码和工作流注释的标记化方式。通过评估接收者操作特征曲线下面积，分析了不同标记化策略的效果。

🏷️ 相关标签

#标记化 #电子健康记录 #基础模型 #临床事件 #预训练

📄 English Summary

Tokenization Tradeoffs in Structured EHR Foundation Models

Foundation models for structured electronic health records (EHRs) are pretrained on longitudinal sequences of timestamped clinical events to learn adaptable patient representations. The process of tokenization determines how these timelines are converted into discrete model inputs, influencing what information is preserved, the efficiency of encoding, and the relationships that need to be learned versus precomputed. However, the impact of tokenization design choices on downstream performance and computational efficiency remains largely unexplored. This study pretrained a transformer model on pediatric EHR data using a factorial design, varying tokenization along event encoding, time encoding, and workflow annotation. The effects of different tokenization strategies were analyzed by evaluating the area under the receiver operating characteristic curve.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Tokenization Tradeoffs in Structured EHR Foundation Models

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误