📄 中文摘要
推测解码(SD)作为加速大型语言模型(LLM)推理的关键技术,通常依赖于在大量数据集上训练草稿模型。从数据中心视角审视这一问题,发现并非所有训练样本对SD的接受率贡献均等。理论分析与实证验证揭示,那些在目标模型中诱导更平坦预测分布的词元,对于提高推测解码的接受率具有显著价值。具体而言,当目标模型的预测分布较为集中时,草稿模型难以准确预测下一个词元,导致接受率较低;而当目标模型的预测分布较为分散(即更平坦)时,草稿模型即使预测不完全准确,也更有可能落在目标模型的高概率区域内,从而提高接受率。这种“平坦性”可以理解为目标模型对下一个词元的预测存在更多不确定性或多种可能性,使得草稿模型有更大的容错空间。通过识别并优先利用这些能产生更平坦预测分布的词元进行训练,可以显著提高草稿模型的效率和性能,进而提升整体推测解码的速度。这种方法提供了一种数据选择策略,旨在优化草稿模型的训练过程,减少对大规模数据集的依赖,并提高训练样本的利用效率。实验结果表明,与传统方法相比,通过选择具有平坦预测分布的词元进行训练,可以在保持甚至提高接受率的同时,显著减少所需的训练数据量和计算资源,为LLM推理加速提供了新的优化方向。
📄 English Summary
Flatter Tokens are More Valuable for Speculative Draft Model Training
Speculative Decoding (SD) is a pivotal technique for accelerating Large Language Model (LLM) inference, typically necessitating the training of a draft model on extensive datasets. Approaching this challenge from a data-centric perspective reveals that not all training samples contribute equally to the SD acceptance rate. Both theoretical analysis and empirical validation demonstrate that tokens inducing flatter predictive distributions from the target model hold significantly more value for enhancing the acceptance rate in speculative decoding. Specifically, when the target model's predictive distribution is sharply peaked, the draft model struggles to accurately predict the next token, leading to a lower acceptance rate. Conversely, when the target model's predictive distribution is more diffuse (i.e., flatter), the draft model, even if not perfectly accurate, has a higher probability of predicting a token within the target model's high-probability region, thereby boosting the acceptance rate. This 'flatness' can be interpreted as the target model having more uncertainty or multiple possibilities for the next token prediction, providing the draft model with greater tolerance for error. By identifying and prioritizing the use of these tokens—which generate flatter predictive distributions—for training, the efficiency and performance of the draft model can be substantially improved, consequently enhancing the overall speed of speculative decoding. This method offers a data selection strategy aimed at optimizing the draft model training process, reducing reliance on massive datasets, and increasing the utilization efficiency of training samples. Experimental results indicate that, compared to conventional approaches, training with tokens exhibiting flat predictive distributions can significantly reduce the required training data volume and computational resources while maintaining or even improving the acceptance rate, offering a novel optimization direction for LLM inference acceleration.