稀疏CLIP:对比学习中可解释性与性能的协同优化

📄 中文摘要

对比语言-图像预训练(CLIP)已成为视觉-语言表示学习的关键基石,为各种下游任务提供支持,并作为多模态大型语言模型(MLLM)中默认的视觉骨干网络。然而,CLIP的密集且不透明的潜在表示带来了显著的可解释性挑战。通常认为可解释性与性能之间存在张力:强制稀疏性通常会导致性能下降。此研究提出了一种新颖的方法,通过在CLIP的视觉编码器中引入结构化稀疏性,以实现可解释性与性能的协同优化,打破了这一传统观念。具体而言,引入了一种可学习的稀疏化机制,在训练过程中动态地识别并剪枝冗余的连接或神经元,从而在不显著牺牲模型性能的前提下,获得更稀疏、更易于理解的视觉特征。

📄 English Summary

Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

Contrastive Language-Image Pre-training (CLIP) has emerged as a cornerstone in vision-language representation learning, underpinning diverse downstream tasks and serving as the default visual backbone in multimodal large language models (MLLMs). Despite its pervasive success, CLIP's dense and opaque latent representations present significant interpretability challenges. A common assumption posits a tension between interpretability and performance: enforcing sparsity traditionally leads to performance degradation. This work introduces a novel approach to co-optimize interpretability and performance in contrastive learning by integrating structured sparsity into CLIP's visual encoder, challenging this conventional wisdom. Specifically, a learnable sparsification mechanism is proposed that dynamically identifies and prunes redundant connections or neurons during training, yielding sparser and more comprehensible visual features without significantly compromising model performance. This sparsity not only contributes to reducing the model's computational complexity and memory footprint but also enhances the transparency of decision-making by reducing feature dimensionality and highlighting salient information. By analyzing these sparse visual features, researchers can gain a deeper understanding of the regions and concepts the model attends to during image-text matching, thereby improving model diagnostics and reliability. Experimental results demonstrate that Sparse CLIP significantly enhances model interpretability while maintaining, and in some cases even surpassing, the performance of the original CLIP across various benchmark datasets, paving a new path towards building more transparent and efficient vision-language models.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等