少即是多：利用小规模噪声合成数据为低资源语言调整文本嵌入

出处: Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data

发布: 2026年3月25日

📄 中文摘要

低资源语言（LRLs）通常缺乏高质量、大规模的数据集，限制了其在检索增强生成（RAG）和语义搜索等任务中的应用。研究挑战了有效语义对齐需要大量数据集或经过人工验证的翻译这一普遍假设。以亚美尼亚语为例，采用了一种成本效益高的适应策略，通过使用开放权重模型将英文Reddit标题-正文对翻译生成的小规模噪声合成数据。建立了一个全面的评估基准，包含现有数据集、翻译数据和手动整理的数据集。实验结果表明，该方法在低资源语言的文本嵌入任务中具有潜在的有效性。

🏷️ 相关标签

#低资源语言 #文本嵌入 #噪声合成数据 #亚美尼亚语 #语义对齐

📄 English Summary

Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data

Low-resource languages (LRLs) often suffer from a lack of high-quality, large-scale datasets, which hinders their application in tasks such as retrieval-augmented generation (RAG) and semantic search. This research challenges the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian, a language with a unique script, a cost-effective adaptation strategy is introduced that utilizes small-scale noisy synthetic data generated by translating English Reddit title-body pairs using open-weight models. A comprehensive evaluation benchmark is established, comprising existing datasets, translated data, and a manually curated dataset. Experimental results demonstrate the potential effectiveness of this approach in text embedding tasks for low-resource languages.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误