回归以向前推进:用于高效灵活的大型多模态模型的递归变换器

📄 中文摘要

大型多模态模型在视觉语言任务中取得了显著成功,但其庞大的参数量在训练和推理过程中往往未得到充分利用。研究提出了一种通过递归精炼重用模型参数的思路,以在不增加模型规模的情况下提取更强的多模态表示。提出的RecursiveVLM是一种针对大型多模态模型的递归变换器架构。其两项关键创新使得有效的递归成为可能:首先,递归连接器通过融合中间层隐藏状态并应用特定于模态的投影,来对齐递归步骤中的特征,尊重视觉和语言标记的不同统计结构;其次,采用了一种新的机制以增强模型的表达能力。

📄 English Summary

Looping Back to Move Forward: Recursive Transformers for Efficient and Flexible Large Multimodal Models

Large Multimodal Models (LMMs) have achieved significant success in vision-language tasks, yet their extensive parameter counts are often underutilized during training and inference. This research proposes a strategy of reusing model parameters through recursive refinement to extract stronger multimodal representations without increasing model size. The proposed RecursiveVLM is a recursive Transformer architecture designed for LMMs. Two key innovations enable effective recursion: firstly, a Recursive Connector aligns features across recursion steps by fusing intermediate-layer hidden states and applying modality-specific projections, respecting the distinct statistical structures of vision and language tokens; secondly, a novel mechanism enhances the model's expressive capability.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等