📄 中文摘要
开放词汇语义分割旨在根据文本标签为图像中的每个像素分配类别。现有方法通常利用视觉语言模型(VLM),例如CLIP,进行密集预测。然而,VLM在图像-文本对上预训练,偏向于显著的、以物体为中心的区域,在适应分割任务时表现出两个关键局限性:首先是前景偏置,即VLM倾向于忽略背景区域,导致背景像素的分类性能不佳,尤其是在背景复杂或与前景语义相关时。其次是语义混淆,VLM在区分语义相似但空间上不相邻的区域时面临挑战,这在开放词汇设置中尤为突出,因为模型需要处理大量未见过的类别,这些类别可能共享相似的视觉特征但具有不同的语义。为了解决这些问题,DiSa框架引入了一种新颖的显著性感知前背景解耦策略。
📄 English Summary
DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation
Open-vocabulary semantic segmentation aims to assign labels to every pixel in an image based on textual descriptions. Current methodologies typically leverage Vision-Language Models (VLMs), such as CLIP, for dense prediction tasks. However, VLMs, pre-trained on image-text pairs, inherently exhibit a bias towards salient, object-centric regions. This characteristic leads to two significant limitations when adapting VLMs for segmentation: (i) Foreground Bias, where the models tend to overlook or poorly represent background regions, resulting in suboptimal performance for pixels designated as background, especially in scenarios with complex or semantically relevant background content. (ii) Semantic Conflation, where VLMs struggle to differentiate between semantically similar but spatially distinct regions. This issue is particularly pronounced in open-vocabulary settings, as the model must handle a vast array of unseen categories that might share visual features but possess unique semantic meanings. To address these challenges, the DiSa framework proposes a novel saliency-aware foreground-background disentangled strategy. DiSa initially decomposes an image into foreground and background regions using a dedicated saliency module, effectively mitigating the inherent foreground bias of VLMs. Subsequently, specialized semantic segmentation heads are designed for both foreground and background regions to enhance the capture of their respective features and semantics. For foreground regions, DiSa capitalizes on the robust object recognition capabilities of VLMs, further augmenting fine-grained semantic discrimination through the integration of a contrastive learning mechanism. For background regions, DiSa employs an adaptive context-aware module that extracts pertinent background features from global image information and fuses them with VLM's background representations, thereby improving background segmentation accuracy.