📄 中文摘要
合成数据增强有助于语言模型在数据受限的领域学习新知识。然而,简单地通过训练更多的合成标记或使用更强的生成器来扩展现有的合成数据方法,往往会导致收益递减,性能低于 RAG。为突破 RAG 的性能瓶颈,提出了合成混合训练方法,该方法结合了合成问答和合成文档,利用它们互补的训练信号,使得随着合成数据量和生成器强度的增加,模型在性能上实现了对 RAG 的 2.6% 相对提升,特别是在长文档阅读理解基准 QuaLITY 上。此外,还引入了一种简单的合成重写技术,进一步增强了模型的表现。
📄 English Summary
Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG
Synthetic data augmentation aids language models in acquiring new knowledge in data-constrained domains. However, simply scaling existing synthetic data methods by increasing the number of synthetic tokens or utilizing stronger generators leads to diminishing returns, often resulting in performance below that of RAG. To overcome this limitation, Synthetic Mixed Training is introduced, which combines synthetic question-answer pairs with synthetic documents. This approach leverages their complementary training signals, allowing for log-linear improvements as both the volume of synthetic data and the strength of the generator increase. As a result, the model achieves a 2.6% relative gain over RAG on the QuaLITY benchmark for long-document reading comprehension. Additionally, a straightforward technique called Focal Rewriting is introduced to further enhance performance.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等