AraModernBERT:阿拉伯语的转标记初始化与长上下文编码建模

📄 中文摘要

AraModernBERT 是对 ModernBERT 编码器架构在阿拉伯语上的适应,研究了转标记嵌入初始化和本地长上下文建模(支持最长 8192 个标记)的影响。转标记化被证明对阿拉伯语言建模至关重要,与非转标记初始化相比,显著提升了掩码语言建模的性能。此外,AraModernBERT 还支持稳定有效的长上下文建模,在扩展序列长度时实现了更好的内在语言建模性能。这些改进为阿拉伯语的自然语言处理任务提供了新的可能性。

📄 English Summary

AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

AraModernBERT is an adaptation of the ModernBERT encoder architecture for Arabic, focusing on the effects of transtokenized embedding initialization and native long-context modeling, capable of handling up to 8,192 tokens. Transtokenization is shown to be crucial for Arabic language modeling, leading to significant improvements in masked language modeling performance compared to non-transtokenized initialization. Additionally, AraModernBERT supports stable and effective long-context modeling, achieving enhanced intrinsic language modeling performance at extended sequence lengths. These advancements open new possibilities for natural language processing tasks in Arabic.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等