大型语言模型中的潜在语义流形

📄 中文摘要

大型语言模型(LLMs)在连续向量空间中进行内部计算,但最终生成离散的标记,这一基本不匹配的几何后果尚不清楚。研究提出了一种数学框架,将LLM的隐藏状态解释为潜在语义流形上的点:一个配备费舍尔信息度量的黎曼子流形,其中标记对应于划分流形的Voronoi区域。定义了表达能力差距,这是一种从词汇离散化引起的语义失真几何度量,并证明了两个定理:任何有限词汇的失真下界,以及通过共面积公式得出的表达能力差距的线性体积缩放法则。研究结果验证了这一理论框架的有效性。

📄 English Summary

Latent Semantic Manifolds in Large Language Models

Large Language Models (LLMs) operate in continuous vector spaces for internal computations while generating discrete tokens, highlighting a fundamental mismatch whose geometric implications are not well understood. A mathematical framework is developed to interpret the hidden states of LLMs as points on a latent semantic manifold: a Riemannian submanifold equipped with the Fisher information metric, where tokens correspond to Voronoi regions that partition the manifold. The concept of the expressibility gap is introduced as a geometric measure of semantic distortion due to vocabulary discretization. Two theorems are proven: a rate-distortion lower bound for any finite vocabulary, and a linear volume scaling law for the expressibility gap via the coarea formula. The validity of this theoretical framework is confirmed through empirical validation.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等