📄 中文摘要
视觉Transformer(ViT)中的位置嵌入(PEs)并非简单的令牌索引,而是作为几何先验,有效塑造表征的空间结构。通过引入令牌级别的诊断方法,量化了ViT表征中多视图几何一致性如何依赖于一致的位置嵌入。在14个基础模型上进行了广泛实验,揭示了位置嵌入在增强或损害空间推理能力方面的复杂作用。具体而言,当位置嵌入能够准确编码物体的相对空间关系时,它们显著提升了模型对几何信息的理解和处理能力,使得模型在处理涉及物体位置、方向和尺度变化的任务时表现更优。然而,当位置嵌入与实际几何结构不符或引入了错误的几何偏置时,它们反而会干扰模型的空间推理过程,导致模型对空间关系的误判,从而降低性能。
📄 English Summary
Geometry without Position? When Positional Embeddings Help and Hurt Spatial Reasoning
Positional embeddings (PEs) within Vision Transformers (ViTs) are re-examined not merely as token indices but as geometric priors that effectively sculpt the spatial structure of representations. Token-level diagnostics are introduced to quantify how multi-view geometric consistency in ViT representations hinges on consistent PEs. Extensive experiments across 14 foundational models reveal the intricate role of PEs in either enhancing or detrimental spatial reasoning capabilities. Specifically, when PEs accurately encode the relative spatial relationships of objects, they significantly boost the model's understanding and processing of geometric information, leading to superior performance in tasks involving object position, orientation, and scale variations. Conversely, when PEs are misaligned with the true geometric structure or introduce erroneous geometric biases, they interfere with the model's spatial reasoning processes, causing misinterpretations of spatial relationships and consequently degrading performance. The findings underscore that the design and application of PEs must precisely align their encoded geometric information with the spatial reasoning capabilities required for a given task. For instance, in tasks demanding fine-grained geometric perception, meticulously designed PEs can serve as powerful inductive biases, guiding the model to learn more robust geometric features. Conversely, in certain scenarios, overly rigid PEs might constrain the model's generalization ability. Therefore, comprehending how PEs influence the geometric representation of ViTs is crucial for developing more efficient and versatile visual models. The study also investigates the impact of different types of PEs (e.g., absolute PEs, relative PEs) on the model's geometric consistency and proposes directions for optimizing PE design to enhance spatial reasoning.