📄 中文摘要
多模态语言模型(MLMs)在语义视觉-语言任务上表现出色,但在需要采纳另一智能体视觉视角的空间推理方面却表现不佳。这些错误反映了模型中普遍存在的自我中心偏见,并引发了对当前模型是否支持异中心推理的质疑。受人类空间认知启发,引入了视角Token,这是一种专门的嵌入,通过两种方式编码方向:一是通过具身化的身体姿态,将模型置于一个虚拟智能体的视角,从而使其能够从该智能体的角度理解和解释视觉信息;二是通过明确的几何变换,将场景中的物体和智能体之间的相对位置和朝向信息编码到Token中,使得模型能够计算和推理不同视角下的空间关系。这些视角Token被集成到多模态Transformer架构中,作为额外的输入信号或注意力机制的引导,以帮助模型在处理视觉和语言信息时,能够动态地调整其空间参考系。通过这种方式,模型不再仅仅从固定的自我中心视角进行理解,而是能够模拟和切换到其他智能体的视角,从而更准确地进行空间推理。例如,在“从我身后看”或“从机器人左侧看”的任务中,模型能够通过解析视角Token,将观察到的图像与相应的空间描述联系起来,并生成符合该视角的新描述或执行相应的动作。实验结果表明,这种方法显著提高了模型在需要异中心空间推理的任务上的表现,例如导航指令理解、多智能体协作和场景描述生成等,有效缓解了现有MLM的自我中心偏见。
📄 English Summary
Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models
Multimodal language models (MLMs) excel in semantic vision-language tasks but struggle with spatial reasoning that necessitates adopting another agent's visual perspective. These shortcomings highlight a persistent egocentric bias within current models and prompt an examination of their capacity for allocentric reasoning. Drawing inspiration from human spatial cognition, perspective tokens are introduced as specialized embeddings designed to encode orientation through two primary mechanisms: first, via embodied body-keypoints, which effectively position the model within the viewpoint of a virtual agent, enabling it to interpret visual information from that agent's perspective; and second, through explicit geometric transformations, which encode the relative positions and orientations between objects and agents within a scene directly into the tokens, facilitating the model's ability to compute and reason about spatial relationships from diverse viewpoints. These perspective tokens are integrated into the multimodal Transformer architecture, serving either as additional input signals or as guiding elements within the attention mechanisms. This integration aids the model in dynamically adjusting its spatial frame of reference when processing visual and linguistic information. Consequently, the model transcends a fixed, egocentric understanding, instead simulating and switching to other agents' perspectives for more accurate spatial reasoning. For instance, in tasks like “viewing from behind me” or “viewing from the robot's left,” the model can parse the perspective tokens to correlate observed images with corresponding spatial descriptions, generating new descriptions aligned with that perspective or executing appropriate actions. Experimental results demonstrate that this approach significantly enhances the model's performance on tasks requiring allocentric spatial reasoning, such as navigation instruction comprehension, multi-agent collaboration, and scene description generation, thereby substantially mitigating the inherent egocentric bias in existing MLMs.