推测解码缩放法则(SDSL):简化吞吐量优化

📄 中文摘要

推测解码是一种利用多个语言模型加速推理的技术。以往的研究通过实验方法优化推理管道的吞吐量,这涉及到大规模语言模型(LLM)的训练,成本较高。该研究提出了一种理论,分析性地将预训练LLM的关键超参数与基于推测解码的下游推理系统的吞吐量效率联系起来。该理论能够在预训练之前预测推理系统各组件的吞吐量最优超参数,从而简化了优化过程。

📄 English Summary

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Speculative decoding is a technique that accelerates inference by utilizing multiple language models. Previous research has relied on experimental methods to optimize the throughput of the inference pipeline, which involves training large language models (LLMs) and can be costly. This study proposes a theory that analytically connects the key hyperparameters of pre-trained LLMs to the throughput efficiency of a downstream inference system based on speculative decoding. The theory enables the prediction of throughput-optimal hyperparameters for the components of an inference system prior to their pre-training, thereby simplifying the optimization process.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等