我在72小时内将谷歌的TurboQuant作为vLLM插件发布——这是其他人未测试的内容

📄 中文摘要

谷歌在2026年ICLR上发布了TurboQuant,这是一种将变换器KV缓存压缩到每个坐标4位的技术,且没有准确性损失。根据论文的报告,在H100 GPU上,内存减少了5-6倍,测试对象包括Gemma和Mistral等文本模型。作者对该技术在处理视频的视觉语言模型上的有效性进行了测试,并在72小时内将turboquant-vllm发布到PyPI,供消费者GPU使用。快速入门指南提供了安装和使用的基本命令。

📄 English Summary

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

Google released TurboQuant at ICLR 2026, a technique that compresses transformer KV caches to 4 bits per coordinate with zero accuracy loss. The paper reports a 5-6x memory reduction on H100 GPUs, tested on text models like Gemma and Mistral. The author aimed to test its effectiveness on a vision-language model processing video and on a consumer GPU. Within 72 hours, the author launched turboquant-vllm on PyPI for public use. A quick start guide is provided, detailing the installation and usage commands.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等