编译视觉编码器:从 Qwen3-VL 在 Hopper GPU 上榨取额外 3% 吞吐量

📄 中文摘要

在 vLLM 框架中运行视觉语言模型时,其会巧妙地编译 LLM 解码器并进行操作融合和 CUDA 图捕获以最大化吞吐量。然而,处理图像的 Vision Transformer (ViT) 编码器却始终以纯急切模式运行。针对 Qwen3-VL 模型,通过改变这一现状,成功在 NVIDIA H200 上实现了 3.4% 的吞吐量提升。此优化过程不仅发现了并修复了三个此前未知的错误,还提供了一个 vLLM 用户可立即启用的单标志更改。该工程故事详细阐述了为何编码器此前未被编译、如何从姊妹模型移植编译支持、在此过程中遇到的问题以及分析器揭示的性能瓶颈。通过对视觉编码器进行编译,显著提升了模型整体的运行效率,为视觉语言模型的部署和性能优化提供了新的思路和实践经验,证明了对模型组件进行细致优化的重要性。

📄 English Summary

Compiling the Vision Encoder: Squeezing 3% More Throughput from Qwen3-VL on Hopper GPUs

When vision-language models are executed via vLLM, the framework intelligently compiles the LLM decoder, fuses operators, and captures CUDA graphs to achieve peak throughput. However, the Vision Transformer (ViT) encoder, responsible for image processing, consistently operates in a plain eager mode. This limitation was addressed for the Qwen3-VL model, resulting in a significant 3.4% throughput increase on an NVIDIA H200 GPU. The optimization effort also uncovered and rectified three previously unknown bugs, culminating in a single-flag modification that any vLLM user can readily enable. This engineering narrative delves into the reasons why the encoder was initially overlooked for compilation, details the process of porting compilation support from a related model, outlines the challenges encountered during implementation, and highlights the insights gained from profiler analysis. By extending compilation to the vision encoder, overall model execution efficiency was substantially improved, offering a novel approach and practical experience for deploying and optimizing vision-language models. This demonstrates the critical importance of meticulous optimization for individual model components.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等