实现高达 41% 更快的预训练:在 B200 上使用 TorchTitan 的 MXFP8 和 DeepEP 进行 DeepSeek-V3

📄 中文摘要

在 PyTorch 和 Nebius 的联合努力下,成功在 256-GPU 的 NVIDIA B200 集群上使用 TorchTitan 训练 DeepSeek-V3 混合专家模型(16B 和 671B)。通过评估两种正交方法,MXFP8 和 DeepEP,显著提高了模型的预训练速度,达到了高达 41% 的提升。这一进展为大规模深度学习模型的训练提供了新的技术路径,推动了高效计算和资源利用的边界。

📄 English Summary

Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan

A collaborative effort between PyTorch and Nebius has successfully enabled the training of DeepSeek-V3 Mixture-of-Experts models (16B and 671B) on a 256-GPU NVIDIA B200 cluster using TorchTitan. By evaluating two orthogonal approaches, MXFP8 and DeepEP, the pre-training speed of the models was significantly improved, achieving up to a 41% increase. This advancement offers a new technical pathway for training large-scale deep learning models, pushing the boundaries of efficient computation and resource utilization.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等