推理模型难以控制其思维链

📄 中文摘要

链式思维(CoT)监控是一种有前景的工具,用于检测现代推理模型的错误行为及其动机。然而,如果模型能够控制其在链式思维中表达的内容,这可能会削弱链式思维的可监控性。为衡量这一不良能力——链式思维可控性,研究提出了链式思维控制评估套件,其中包含要求模型在遵循链式思维指令的情况下解决问题的任务,例如在不使用“染色体”一词的情况下推理关于遗传学的问题。研究表明,推理模型的链式思维可控性显著低于输出可控性;例如,Claude Sonnet 4.5在控制其链式思维时仅能做到2.7%,而在控制最终输出时则能达到61.9%。

📄 English Summary

Reasoning Models Struggle to Control their Chains of Thought

Chain-of-thought (CoT) monitoring serves as a promising tool for detecting misbehaviors and understanding the motivations behind modern reasoning models. However, if these models can control what they verbalize in their CoT, it may undermine the monitorability of CoT. To assess this undesirable capability—CoT controllability—a new evaluation suite called CoT-Control has been introduced. This suite includes tasks that require models to solve problems while adhering to specific CoT instructions, such as reasoning about a genetics question without using the term 'chromosome'. The findings reveal that reasoning models exhibit significantly lower CoT controllability compared to output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time, while it achieves 61.9% control over its final output.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等