半监督掩码自编码器:少量数据下释能视觉Transformer潜力

📄 中文摘要

在标签数据稀缺但无标签数据充裕的场景中训练视觉Transformer(ViT)面临挑战。针对此问题,提出半监督掩码自编码器(SSMAE)框架,该框架通过结合无标签和有标签样本,并利用动态选择的伪标签,共同优化掩码图像重建和分类任务。SSMAE的关键创新在于引入了验证驱动的门控机制,该机制仅在模型达到一定性能后才激活伪标签生成。具体而言,模型在训练初期侧重于自监督学习,通过掩码图像重建任务学习图像的内在表示。随着训练的进行,当模型在验证集上表现出足够的泛化能力时,门控机制允许利用未标记数据生成伪标签。伪标签的生成过程采用教师-学生模型或一致性正则化等方法,确保伪标签的质量。

📄 English Summary

Semi-Supervised Masked Autoencoders: Unlocking Vision Transformer Potential with Limited Data

Training Vision Transformers (ViTs) presents a significant challenge when labeled data is scarce but unlabeled data is abundant. To address this, a novel framework, Semi-Supervised Masked Autoencoder (SSMAE), is proposed. SSMAE jointly optimizes masked image reconstruction and classification tasks by leveraging both unlabeled and labeled samples, incorporating dynamically selected pseudo-labels. A key innovation of SSMAE is the introduction of a validation-driven gating mechanism that activates pseudo-labeling only after the model achieves a certain level of performance. Specifically, during the initial training phase, the model primarily focuses on self-supervised learning, acquiring intrinsic image representations through masked image reconstruction. As training progresses and the model demonstrates sufficient generalization capability on the validation set, the gating mechanism enables the generation of pseudo-labels from unlabeled data. The pseudo-label generation process employs techniques such as teacher-student models or consistency regularization to ensure the quality of these pseudo-labels. These high-quality pseudo-labels are then utilized in the supervised learning phase, alongside the limited existing true labeled data, to collaboratively train the classifier. This strategy of progressively activating pseudo-labeling effectively mitigates performance degradation caused by inaccurate pseudo-labels in the early stages of training, while fully exploiting the vast amount of unlabeled data. Through this joint optimization and dynamic pseudo-labeling strategy, SSMAE significantly enhances the performance of ViTs under limited labeled data conditions, enabling them to learn richer visual features from unlabeled data. This leads to improved generalization ability and robustness across various downstream tasks.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等