迈向多模态大型语言模型的联邦预训练

📄 中文摘要

多模态大型语言模型(MLLM)的快速发展受到高质量公共数据饱和的制约,而大量多样化的多模态数据仍然被隐私敏感的环境所封闭。联邦学习(FL)提供了一种有希望的解决方案,以解锁这些分布式资源,但现有研究主要集中在微调阶段,基础的预训练阶段尚未得到充分探索。提出了联邦 MLLM 对齐(Fed-MA)任务,这是一种轻量级的预训练范式,通过冻结视觉编码器和大型语言模型(LLM),在协作训练跨模态投影器的同时,解决了两个关键挑战:一是聚合本地参数时的干扰问题,二是如何有效利用分布式数据资源。该研究为多模态模型的预训练开辟了新的方向。

📄 English Summary

A Step Toward Federated Pretraining of Multimodal Large Language Models

The rapid evolution of Multimodal Large Language Models (MLLMs) is hindered by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain locked in privacy-sensitive environments. Federated Learning (FL) presents a promising approach to unlock these distributed resources, yet existing research primarily focuses on fine-tuning, leaving the foundational pre-training phase largely unexplored. This study introduces the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and the Large Language Model (LLM) while collaboratively training the cross-modal projector. Two critical challenges are identified in this setting: parameter interference during local aggregation and the effective utilization of distributed data resources. This research opens new avenues for pre-training multimodal models.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等