📄 中文摘要
研究表明,基于变换器的视觉-语言模型(VLMs)存在显著的深度冗余,但去除特定解码器层的效果尚不明确,尤其是在需要感知与多步推理紧密结合的领域中。通过领域感知的激活相似性,研究了结构化解码器层剪枝,测量每一层在数学与非数学输入下对表示的转化强度。这一方法生成了简单的数学感知、非数学感知和混合排名标准,识别出在目标领域内输入输出激活变化最小的层。在两个最先进的VLMs和一系列数学及一般多模态基准测试中,发现了一致的三种机制结构。
📄 English Summary
Understanding Pruning Regimes in Vision-Language Models Through Domain-Aware Layer Selection
This study reveals that transformer-based vision-language models (VLMs) exhibit significant depth redundancy, yet the impact of removing specific decoder layers is not well understood, particularly in domains requiring a tight coupling between perception and multi-step reasoning. Structured decoder layer pruning is investigated through the lens of domain-aware activation similarity, measuring how strongly each layer transforms representations for math versus non-math inputs. This approach yields simple math-aware, non-math-aware, and mixed ranking criteria that identify layers whose input-output activations change the least within a target domain. Across two state-of-the-art VLMs and a broad suite of math and general multimodal benchmarks, a consistent three-regime structure is uncovered.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等