在本地运行 Qwen 397B 的苹果“闪电 LLM”研究

📄 中文摘要

Dan Woods 的研究展示了如何在 48GB 的 MacBook Pro M3 Max 上以每秒 5.5+ 个令牌的速度运行定制版 Qwen3.5-397B-A17B,尽管该模型在磁盘上占用 209GB(量化后为 120GB)。Qwen3.5-397B-A17B 是一种专家混合模型(MoE),每个令牌只需与整体模型权重的子集进行计算。这种专家权重可以从 SSD 流式传输到内存中,从而避免了所有权重同时占用 RAM 的问题。Dan 采用了苹果 2023 年发布的技术来实现这一目标。

📄 English Summary

Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally

Dan Woods' research demonstrates how to run a custom version of Qwen3.5-397B-A17B at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model occupying 209GB (120GB when quantized) on disk. Qwen3.5-397B-A17B is a Mixture-of-Experts (MoE) model, meaning each token only needs to compute against a subset of the overall model weights. These expert weights can be streamed into memory from SSD, preventing the need for all weights to be held in RAM simultaneously. Dan utilized techniques introduced by Apple in 2023 to achieve this.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等