AraModernBERT：阿拉伯语的转标记初始化与长上下文编码建模

出处: AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

发布: 2026年3月12日

📄 中文摘要

AraModernBERT 是对 ModernBERT 编码器架构在阿拉伯语上的适应，研究了转标记嵌入初始化和本地长上下文建模（支持最长 8192 个标记）的影响。转标记化被证明对阿拉伯语言建模至关重要，与非转标记初始化相比，显著提升了掩码语言建模的性能。此外，AraModernBERT 还支持稳定有效的长上下文建模，在扩展序列长度时实现了更好的内在语言建模性能。这些改进为阿拉伯语的自然语言处理任务提供了新的可能性。

🏷️ 相关标签

#阿拉伯语 #转标记化 #长上下文建模 #语言建模 #自然语言处理

📄 English Summary

AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

AraModernBERT is an adaptation of the ModernBERT encoder architecture for Arabic, focusing on the effects of transtokenized embedding initialization and native long-context modeling, capable of handling up to 8,192 tokens. Transtokenization is shown to be crucial for Arabic language modeling, leading to significant improvements in masked language modeling performance compared to non-transtokenized initialization. Additionally, AraModernBERT supports stable and effective long-context modeling, achieving enhanced intrinsic language modeling performance at extended sequence lengths. These advancements open new possibilities for natural language processing tasks in Arabic.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误