我、自己与π:评估和解释大型语言模型的内省能力

📄 中文摘要

内省是人类智能的一个标志,指的是评估和推理自身认知过程的能力。在大型语言模型(LLMs)中,内省作为一种有前景但存在争议的能力逐渐受到关注。然而,当前的评估往往无法区分真正的元认知与一般世界知识或基于文本的自我模拟的简单应用。为此,研究提出了一种原则性分类法,将内省形式化为对模型策略和参数的特定运算的潜在计算。为了隔离广义内省的组成部分,研究团队推出了Introspect-Bench,一个多方面的评估套件,旨在进行严格的能力测试。结果表明,前沿模型在内省能力上表现出显著的差异。

📄 English Summary

Me, Myself, and $\pi$ : Evaluating and Explaining LLM Introspection

Introspection, the ability to assess and reason about one's own cognitive processes, is a hallmark of human intelligence and has emerged as a promising yet contested capability in large language models (LLMs). Current evaluations often fail to distinguish genuine meta-cognition from mere applications of general world knowledge or text-based self-simulation. This research proposes a principled taxonomy that formalizes introspection as the latent computation of specific operators over a model's policy and parameters. To isolate the components of generalized introspection, Introspect-Bench is introduced as a multifaceted evaluation suite designed for rigorous capability testing. Results indicate that frontier models exhibit significant variations in introspective abilities.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等