Spectra:在谱各向异性下重新思考大语言模型的优化器

📄 中文摘要

大语言模型(LLM)训练中的梯度信号表现出高度的各向异性:重复的语言结构将能量集中在少数主导的谱方向上,而特定上下文的信息则分布在较长的尾部。研究表明,这种尖峰与尾部的分离在整个训练过程中持续存在,尖峰仅占约1.5%的方向,却主导了优化器的统计特性。这种主导性通过二阶矩归一化抑制了尾部学习,收缩了尾部更新,并收紧了全局稳定学习率的界限。基于这一分析,提出了一种名为Spectra的优化器,该优化器能够抑制主导的低秩尖峰子空间,同时不放大对噪声敏感的谱尾部。Spectra能够有效跟踪尖峰子空间,促进更好的学习效果。

📄 English Summary

Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy

Gradient signals in training large language models (LLMs) exhibit high anisotropy: recurrent linguistic structures concentrate energy into a small set of dominant spectral directions, while context-specific information resides in a long tail. This spike-tail separation persists throughout training, with the spike occupying only about 1.5% of directions yet dominating optimizer statistics. This dominance suppresses tail learning by contracting tail updates through second moment normalization and tightening the globally stable learning rate bound. Motivated by this analysis, Spectra is proposed as a spike-aware optimizer that suppresses the dominant low-rank spike subspace without amplifying the noise-sensitive spectral tail. Spectra effectively tracks the spike subspace, facilitating improved learning outcomes.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等