从视觉到多模态:动物识别中的编码器和融合策略的系统消融研究
📄 中文摘要
自动化动物识别是帮助失宠物与主人团聚的实用任务,但现有系统常因数据集规模有限和依赖单一视觉线索而面临挑战。研究提出了一种多模态验证框架,通过合成文本描述衍生的语义身份先验来增强视觉特征。为支持这一研究,构建了一个包含1.9百万张照片和695,091种独特动物的大型训练语料库。通过系统消融研究,识别出SigLIP2-Giant和E5-Small-v2作为最佳视觉和文本骨干网络。此外,评估了从简单连接到自适应门控的多种融合策略,以确定最佳的特征集成方法。
📄 English Summary
From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification
Automated animal identification serves as a practical solution for reuniting lost pets with their owners, yet existing systems often face challenges due to limited dataset sizes and reliance on unimodal visual cues. This study proposes a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. A massive training corpus comprising 1.9 million photographs of 695,091 unique animals was constructed to support this investigation. Systematic ablation studies identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. Additionally, various fusion strategies, ranging from simple concatenation to adaptive gating, were evaluated to determine the most effective method for feature integration.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等