困惑性悖论:为何代码在大型语言模型提示中压缩效果优于数学
📄 中文摘要
研究表明,代码生成在提示压缩方面具有较强的容忍度,而链式推理的性能则逐渐下降。通过对六个代码基准(HumanEval、MBPP、HumanEval+、MultiPL-E)和四个推理基准(GSM8K、MATH、ARC-Challenge、MMLU-STEM)的验证,确认了压缩阈值在不同语言和难度上的普适性。此外,首次进行的逐词困惑度分析揭示了“困惑性悖论”:代码语法标记在高困惑度下得以保留,而推理过程则受到影响。这些发现为理解代码与数学在提示压缩中的表现差异提供了新的视角,并为未来的自适应算法奠定了基础。
📄 English Summary
The Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts
The study reveals that code generation exhibits a strong tolerance for aggressive prompt compression, while chain-of-thought reasoning degrades gradually. Validation across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL-E) and four reasoning benchmarks (GSM8K, MATH, ARC-Challenge, MMLU-STEM) confirms the generalizability of the compression threshold across languages and difficulties. Furthermore, the first per-token perplexity analysis uncovers a 'perplexity paradox': code syntax tokens are preserved under high perplexity, whereas reasoning processes are adversely affected. These findings provide new insights into the differences in prompt compression performance between code and mathematics, laying the groundwork for future adaptive algorithms.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等