实现高达 41% 更快的预训练：在 B200 上使用 TorchTitan 的 MXFP8 和 DeepEP 进行 DeepSeek-V3

出处: Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan

发布: 2026年3月25日

📄 中文摘要

在 PyTorch 和 Nebius 的联合努力下，成功在 256-GPU 的 NVIDIA B200 集群上使用 TorchTitan 训练 DeepSeek-V3 混合专家模型（16B 和 671B）。通过评估两种正交方法，MXFP8 和 DeepEP，显著提高了模型的预训练速度，达到了高达 41% 的提升。这一进展为大规模深度学习模型的训练提供了新的技术路径，推动了高效计算和资源利用的边界。

📄 English Summary

Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan

A collaborative effort between PyTorch and Nebius has successfully enabled the training of DeepSeek-V3 Mixture-of-Experts models (16B and 671B) on a 256-GPU NVIDIA B200 cluster using TorchTitan. By evaluating two orthogonal approaches, MXFP8 and DeepEP, the pre-training speed of the models was significantly improved, achieving up to a 41% increase. This advancement offers a new technical pathway for training large-scale deep learning models, pushing the boundaries of efficient computation and resource utilization.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

实现高达 41% 更快的预训练：在 B200 上使用 TorchTitan 的 MXFP8 和 DeepEP 进行 DeepSeek-V3

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan

🏷️ Related Tags

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误