少即是多:利用小规模噪声合成数据为低资源语言调整文本嵌入
📄 中文摘要
低资源语言(LRLs)通常缺乏高质量、大规模的数据集,限制了其在检索增强生成(RAG)和语义搜索等任务中的应用。研究挑战了有效语义对齐需要大量数据集或经过人工验证的翻译这一普遍假设。以亚美尼亚语为例,采用了一种成本效益高的适应策略,通过使用开放权重模型将英文Reddit标题-正文对翻译生成的小规模噪声合成数据。建立了一个全面的评估基准,包含现有数据集、翻译数据和手动整理的数据集。实验结果表明,该方法在低资源语言的文本嵌入任务中具有潜在的有效性。
📄 English Summary
Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data
Low-resource languages (LRLs) often suffer from a lack of high-quality, large-scale datasets, which hinders their application in tasks such as retrieval-augmented generation (RAG) and semantic search. This research challenges the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian, a language with a unique script, a cost-effective adaptation strategy is introduced that utilizes small-scale noisy synthetic data generated by translating English Reddit title-body pairs using open-weight models. A comprehensive evaluation benchmark is established, comprising existing datasets, translated data, and a manually curated dataset. Experimental results demonstrate the potential effectiveness of this approach in text embedding tasks for low-resource languages.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等