基于多模态大语言模型的零-shot人机交互检测无检测器依赖的交互识别
📄 中文摘要
零-shot人机交互(HOI)检测旨在定位图像中的人类和物体,并识别它们之间的交互。尽管开放词汇物体检测的进展为物体定位提供了有希望的解决方案,但由于交互的组合多样性,交互识别(IR)仍然面临挑战。现有方法,包括两阶段方法,紧密耦合IR与特定检测器,并依赖粗粒度的视觉-语言模型(VLM)特征,这限制了对未见交互的泛化能力。研究提出了一种解耦框架,将物体检测与IR分离,并利用多模态大语言模型(MLLMs)实现零-shot IR。引入了一种确定性生成方法,进一步提升了交互识别的准确性和灵活性。
📄 English Summary
Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage approaches, tightly couple IR with specific detectors and rely on coarse-grained vision-language model (VLM) features, limiting generalization to unseen interactions. A decoupled framework is proposed that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. A deterministic generation method is introduced to enhance the accuracy and flexibility of interaction recognition.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等