神经翻译中的普遍概念结构:探究 NLLB-200 的多语言几何
📄 中文摘要
研究表明,神经机器翻译模型是否学习了语言通用的概念表征,或者仅仅是通过表面相似性对语言进行聚类。通过六项实验,探讨了 Meta 的 NLLB-200 模型的表征几何,将自然语言处理的可解释性与多语言词汇组织的认知科学理论相结合。使用嵌入在 135 种语言中的 Swadesh 核心词汇表,发现模型的嵌入距离与来自自动相似性判断程序的系统发育距离显著相关($ho = 0.13$, $p = 0.020$),证明 NLLB-200 隐式学习了人类语言的谱系结构。
📄 English Summary
Universal Conceptual Structure in Neural Translation: Probing NLLB-200's Multilingual Geometry
This study investigates whether neural machine translation models learn language-universal conceptual representations or merely cluster languages based on surface similarities. The representation geometry of Meta's NLLB-200, a 200-language encoder-decoder Transformer, is probed through six experiments that bridge NLP interpretability with cognitive science theories of multilingual lexical organization. Utilizing the Swadesh core vocabulary list embedded across 135 languages, significant correlations are found between the model's embedding distances and phylogenetic distances from the Automated Similarity Judgment Program ($ho = 0.13$, $p = 0.020$). This demonstrates that NLLB-200 has implicitly learned the genealogical structure of human languages.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等