合成混合训练：超越 RAG 的参数知识获取扩展

出处: Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

发布: 2026年3月26日

📄 中文摘要

合成数据增强有助于语言模型在数据受限的领域学习新知识。然而，简单地通过训练更多的合成标记或使用更强的生成器来扩展现有的合成数据方法，往往会导致收益递减，性能低于 RAG。为突破 RAG 的性能瓶颈，提出了合成混合训练方法，该方法结合了合成问答和合成文档，利用它们互补的训练信号，使得随着合成数据量和生成器强度的增加，模型在性能上实现了对 RAG 的 2.6% 相对提升，特别是在长文档阅读理解基准 QuaLITY 上。此外，还引入了一种简单的合成重写技术，进一步增强了模型的表现。

🏷️ 相关标签

#合成数据 #混合训练 #语言模型 #知识获取 #长文档理解

📄 English Summary

Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

Synthetic data augmentation aids language models in acquiring new knowledge in data-constrained domains. However, simply scaling existing synthetic data methods by increasing the number of synthetic tokens or utilizing stronger generators leads to diminishing returns, often resulting in performance below that of RAG. To overcome this limitation, Synthetic Mixed Training is introduced, which combines synthetic question-answer pairs with synthetic documents. This approach leverages their complementary training signals, allowing for log-linear improvements as both the volume of synthetic data and the strength of the generator increase. As a result, the model achieves a 2.6% relative gain over RAG on the QuaLITY benchmark for long-document reading comprehension. Additionally, a straightforward technique called Focal Rewriting is introduced to further enhance performance.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误