📄 中文摘要
旋转位置编码(RoPE)的扩展旨在处理比预训练阶段更长的序列。然而,现有的扩展策略在方法上高度多样化,并且缺乏统一的理论基础。MrRoPE(混合基数RoPE)提出了一种基于基数系统转换视角的广义编码公式,优雅地统一并概括了当前主要的RoPE扩展方法。通过将序列位置视为在特定基数系统中的表示,MrRoPE将RoPE的频率分配策略推广到非十进制基数。具体而言,它允许在不同维度上使用不同的基数,从而实现更灵活且适应性强的频率衰减模式。这种混合基数的方法能够更好地捕捉长序列中的局部和全局依赖关系,因为不同的基数可以对应不同的周期性或衰减速度。
📄 English Summary
MrRoPE: Mixed-radix Rotary Position Embedding
Extension of Rotary Position Embedding (RoPE) aims to handle sequences longer than those encountered during pre-training. However, current extension strategies are highly diverse and lack a unified theoretical foundation. MrRoPE (Mixed-radix RoPE) proposes a generalized encoding formulation based on a radix system conversion perspective, which elegantly unifies and generalizes prevailing RoPE extension methods. By viewing sequence positions as representations in a specific radix system, MrRoPE extends RoPE's frequency allocation strategy to non-decimal radices. Specifically, it allows for the use of different radices across various dimensions, enabling more flexible and adaptive frequency decay patterns. This mixed-radix approach is better equipped to capture both local and global dependencies in long sequences, as different radices can correspond to distinct periodicities or decay rates. The design of MrRoPE not only encompasses popular methods like linear interpolation and NTK-RoPE as its special cases but also provides a theoretical framework for understanding their underlying mechanisms. This framework, by adjusting radices and associated scaling factors, can flexibly generate new, optimized positional encoding schemes. Experimental results demonstrate that MrRoPE exhibits superior generalization capabilities and performance improvements compared to existing methods when handling long sequence tasks, particularly in mitigating catastrophic forgetting issues when models need to process inputs significantly longer than their pre-training length during inference. Its core idea lies in providing a deeper mathematical interpretation and a broader design space for positional encoding's frequency allocation through radix system conversion, thereby offering new directions for future long-sequence modeling.