基于多模态大语言模型的零-shot人机交互检测无检测器依赖的交互识别

出处: Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition

发布: 2026年2月18日

📄 中文摘要

零-shot人机交互（HOI）检测旨在定位图像中的人类和物体，并识别它们之间的交互。尽管开放词汇物体检测的进展为物体定位提供了有希望的解决方案，但由于交互的组合多样性，交互识别（IR）仍然面临挑战。现有方法，包括两阶段方法，紧密耦合IR与特定检测器，并依赖粗粒度的视觉-语言模型（VLM）特征，这限制了对未见交互的泛化能力。研究提出了一种解耦框架，将物体检测与IR分离，并利用多模态大语言模型（MLLMs）实现零-shot IR。引入了一种确定性生成方法，进一步提升了交互识别的准确性和灵活性。

🏷️ 相关标签

#人机交互 #零-shot检测 #多模态大语言模型 #交互识别

📄 English Summary

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition

Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage approaches, tightly couple IR with specific detectors and rely on coarse-grained vision-language model (VLM) features, limiting generalization to unseen interactions. A decoupled framework is proposed that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. A deterministic generation method is introduced to enhance the accuracy and flexibility of interaction recognition.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误