超越标量:通过几何进展和稳定性评估与理解大型语言模型的推理
📄 中文摘要
评估大型语言模型(LLM)的可靠性时,单纯依赖标量概率往往无法捕捉推理的结构动态。TRACED框架通过理论基础的几何运动学评估推理质量。将推理轨迹分解为进展(位移)和稳定性(曲率),揭示出一种独特的拓扑差异:正确的推理表现为高进展、稳定的轨迹,而幻觉则表现为低进展、不稳定的模式(停滞位移伴随高曲率波动)。利用这些特征,所提出的概率框架在多种基准测试中实现了竞争力的表现和卓越的鲁棒性。TRACED有效地桥接了几何学与认知科学之间的关系。
📄 English Summary
Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability
Evaluating the reliability of large language models (LLMs) through scalar probabilities often fails to capture the structural dynamics of reasoning. The TRACED framework introduces a theoretically grounded approach to assess reasoning quality via geometric kinematics. By decomposing reasoning traces into Progress (displacement) and Stability (curvature), a distinct topological divergence is revealed: correct reasoning manifests as high-progress, stable trajectories, while hallucinations are characterized by low-progress, unstable patterns (stalled displacement with high curvature fluctuations). Leveraging these signatures, the proposed probabilistic framework achieves competitive performance and superior robustness across diverse benchmarks. TRACED effectively bridges the gap between geometry and cognitive science.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等