超越标量：通过几何进展和稳定性评估与理解大型语言模型的推理

出处: Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

发布: 2026年3月12日

📄 中文摘要

评估大型语言模型（LLM）的可靠性时，单纯依赖标量概率往往无法捕捉推理的结构动态。TRACED框架通过理论基础的几何运动学评估推理质量。将推理轨迹分解为进展（位移）和稳定性（曲率），揭示出一种独特的拓扑差异：正确的推理表现为高进展、稳定的轨迹，而幻觉则表现为低进展、不稳定的模式（停滞位移伴随高曲率波动）。利用这些特征，所提出的概率框架在多种基准测试中实现了竞争力的表现和卓越的鲁棒性。TRACED有效地桥接了几何学与认知科学之间的关系。

🏷️ 相关标签

#大型语言模型 #推理质量 #几何运动学 #进展与稳定性 #概率框架

📄 English Summary

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

Evaluating the reliability of large language models (LLMs) through scalar probabilities often fails to capture the structural dynamics of reasoning. The TRACED framework introduces a theoretically grounded approach to assess reasoning quality via geometric kinematics. By decomposing reasoning traces into Progress (displacement) and Stability (curvature), a distinct topological divergence is revealed: correct reasoning manifests as high-progress, stable trajectories, while hallucinations are characterized by low-progress, unstable patterns (stalled displacement with high curvature fluctuations). Leveraging these signatures, the proposed probabilistic framework achieves competitive performance and superior robustness across diverse benchmarks. TRACED effectively bridges the gap between geometry and cognitive science.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误