基于嵌入的高效合成数据生成用于复杂推理任务

📄 中文摘要

合成数据生成(SDG)利用大型语言模型(LLMs),已被广泛认可并应用于提升较小但更具资源和计算效率的LLMs的性能。确保生成数据的质量和多样性是SDG面临的关键挑战。通过分析生成数据在嵌入空间中的多样性和分布,发现特定邻域内样本密度与该区域样本预测准确性之间存在强相关性。基于这一洞察,提出了一种针对嵌入基础采样的目标化流程,旨在增强数据多样性,并持续提升模型性能。

📄 English Summary

Efficient Embedding-based Synthetic Data Generation for Complex Reasoning Tasks

Synthetic Data Generation (SDG), leveraging Large Language Models (LLMs), has been widely recognized and adopted as an effective method to enhance the performance of smaller, resource-efficient LLMs through fine-tuning. A significant challenge in SDG is ensuring the quality and diversity of the generated data. An analysis of the diversity and distribution of generated data in the embedding space reveals a strong correlation between the density of examples within a specific neighborhood and the accuracy of predictions for examples drawn from that region. Building on this insight, a targeted pipeline for embedding-based sampling is presented, which enhances data diversity and consistently improves performance.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等