视觉-语言模型中的自我中心偏差

出处: Egocentric Bias in Vision-Language Models

发布: 2026年2月19日

📄 中文摘要

该研究提出了FlipSet,一个用于视觉-语言模型(VLMs)中二级视觉视角采集(L2 VPT)的诊断基准。该任务要求模拟从另一个代理的视角进行180度旋转的2D字符字符串,旨在将空间变换与3D场景复杂性分离。对103个视觉-语言模型的评估显示出系统性的自我中心偏差:绝大多数模型的表现低于随机水平,约四分之三的错误重现了相机视角。控制实验揭示了组成性缺陷——模型在理论心智准确性和单独的心理旋转任务中表现良好,但在综合任务中却出现严重失败。

📄 English Summary

Egocentric Bias in Vision-Language Models

This study introduces FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models (VLMs). The task involves simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformations from the complexities of 3D scenes. Evaluation of 103 VLMs reveals a systematic egocentric bias: the majority perform below chance level, with approximately three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit—models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically in integrated tasks.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等