在本地运行 Qwen 397B 的苹果“闪电 LLM”研究

出处: Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally

发布: 2026年3月19日

📄 中文摘要

Dan Woods 的研究展示了如何在 48GB 的 MacBook Pro M3 Max 上以每秒 5.5+ 个令牌的速度运行定制版 Qwen3.5-397B-A17B，尽管该模型在磁盘上占用 209GB（量化后为 120GB）。Qwen3.5-397B-A17B 是一种专家混合模型（MoE），每个令牌只需与整体模型权重的子集进行计算。这种专家权重可以从 SSD 流式传输到内存中，从而避免了所有权重同时占用 RAM 的问题。Dan 采用了苹果 2023 年发布的技术来实现这一目标。

🏷️ 相关标签

#苹果 #Qwen #LLM #本地运行 #专家混合模型

📄 English Summary

Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally

Dan Woods' research demonstrates how to run a custom version of Qwen3.5-397B-A17B at over 5.5 tokens per second on a 48GB MacBook Pro M3 Max, despite the model occupying 209GB (120GB when quantized) on disk. Qwen3.5-397B-A17B is a Mixture-of-Experts (MoE) model, meaning each token only needs to compute against a subset of the overall model weights. These expert weights can be streamed into memory from SSD, preventing the need for all weights to be held in RAM simultaneously. Dan utilized techniques introduced by Apple in 2023 to achieve this.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误