摩擦的旋转位置嵌入与长输入:几何视角

📄 中文摘要

旋转位置嵌入(RoPE)是一种在语言模型中广泛采用的位置编码技术,尽管其有效性显著,但在输入长度超过训练长度时会导致性能下降。已有分析指出,长输入会导致通道旋转“超出分布”,但额外的旋转如何与病态行为相关或导致病态行为尚不明确。通过实证和理论分析,提供了对RoPE下注意力行为的统一几何理解。研究发现,注意力机制导致分离的键和值潜在点云的紧密聚类,从而创建了“沉没令牌”:这些占位符使得注意力头在不需要时避免令牌混合。将RoPE应用于更长输入时,表现出特定的几何特征,影响了模型的性能。

📄 English Summary

Frayed RoPE and Long Inputs: A Geometric Perspective

Rotary Positional Embedding (RoPE) is a widely adopted technique for encoding positions in language models, which, while effective, experiences performance breakdown when input lengths exceed training lengths. Prior analyses have rightly asserted that long inputs cause channels to rotate 'out of distribution,' but the relationship between extra rotation and pathological behavior remains unclear. Through empirical and theoretical analysis, a unified geometric understanding of attention behavior with RoPE is advanced. The study finds that attention induces tight clustering of separated key and query latent point clouds, allowing for the creation of sink tokens: placeholders that enable attention heads to avoid token mixing when not required. The application of RoPE to longer inputs reveals specific geometric characteristics that impact model performance.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等