📄 中文摘要
开放词汇目标检测(OVD)技术通过视觉-语言模型实现了对新颖类别的零样本识别,在自然图像领域展现出卓越性能。然而,其在航空影像领域的迁移能力尚未被探索。本研究首次系统性地评估了五种最先进的开放词汇目标检测模型在LAE-80C航空数据集上的表现。LAE-80C数据集包含3,592张图像和80个类别,评估严格遵循零样本条件。实验协议旨在隔离模型对语义概念的理解能力,而非其对特定图像特征的泛化能力。通过对这些模型的性能进行深入分析,揭示了当前开放词汇检测器在航空影像领域面临的挑战和机遇。
📄 English Summary
Do Open-Vocabulary Detectors Transfer to Aerial Imagery? A Comparative Evaluation
Open-vocabulary object detection (OVD) leverages vision-language models for zero-shot recognition of novel categories, achieving strong performance on natural images. Nevertheless, its transferability to aerial imagery remains largely unexplored. This research presents the first systematic benchmark evaluating five state-of-the-art OVD models on the LAE-80C aerial dataset, which comprises 3,592 images and 80 categories, under strict zero-shot conditions. The experimental protocol is meticulously designed to isolate the models' understanding of semantic concepts from their generalization capabilities to specific image features. An in-depth analysis of these models' performance reveals the challenges and opportunities for current open-vocabulary detectors in the aerial imagery domain. Results indicate that despite their excellent performance on natural images, a significant performance drop is observed on aerial imagery due to differences in perspective, object scale, background complexity, and category distribution. Specifically, the zero-shot recognition capabilities of these models struggle with detecting small, densely packed, or highly occluded aerial targets. The evaluation also investigates the impact of different vision-language model architectures on performance, identifying which model design characteristics contribute to improved detection accuracy and robustness in aerial imagery scenarios. The experimental data provides a crucial benchmark and direction for future development of more effective open-vocabulary detectors tailored for aerial imagery, emphasizing the need for further research to bridge the domain gap between natural and aerial images for more efficient zero-shot object recognition.