迷失太空?视觉语言模型在相对相机位姿估计中面临挑战

📄 中文摘要

视觉语言模型(VLMs)在二维感知和语义推理方面表现出色,但其对三维空间结构的理解相对有限。这种性能差异通过相对相机位姿估计(RCPE)任务得到深入探究,RCPE是一项基础视觉任务,要求从一对图像中推断出相对的相机平移和旋转。为解决这一问题,引入了VRRPI-Bench基准测试,该基准源自带有口头标注的未标记自我中心视频,这些标注描述了相对位姿信息。VRRPI-Bench通过提供真实世界场景下的复杂数据,旨在评估VLMs在处理三维空间几何方面的能力。基准测试中包含多样化的场景和运动模式,例如车辆行驶、机器人操作和人体活动,这些场景对模型的鲁棒性和泛化能力提出了更高要求。

📄 English Summary

Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation

Vision-Language Models (VLMs) demonstrate strong capabilities in 2D perception and semantic reasoning, yet exhibit a limited understanding of 3D spatial structure. This discrepancy is investigated using relative camera pose estimation (RCPE), a fundamental vision task that requires inferring relative camera translation and rotation from a pair of images. To address this, VRRPI-Bench is introduced, a benchmark derived from unlabeled egocentric videos augmented with verbalized annotations of relative pose information. VRRPI-Bench provides complex real-world data to evaluate VLMs' proficiency in handling 3D spatial geometry. The benchmark encompasses diverse scenarios and motion patterns, such as vehicle driving, robotic manipulation, and human activities, posing significant challenges to model robustness and generalization. Through VRRPI-Bench, it is observed that current mainstream VLMs, including CLIP, ALIGN, and BLIP, exhibit notable deficiencies in accurately estimating relative camera poses. Specifically, their performance degrades significantly when dealing with large rotations and depth variations, leading to substantial estimation errors. Analysis suggests that VLMs inherently lack an understanding of 3D geometric constraints; their pre-training processes primarily focus on 2D image features and textual semantic associations, failing to adequately learn the underlying 3D depth, viewpoint, and motion information embedded in images. Furthermore, VLMs' accuracy in pose estimation is severely impacted by complex visual conditions such as occlusions, illumination changes, and textureless regions. This research underscores the importance of developing novel VLM architectures and training paradigms that can better integrate 3D geometric knowledge and learn more robust camera motion representations from large-scale video data.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等