通过执行驱动的推理增强提升大型语言模型数学问题解决能力

📄 中文摘要

大型语言模型在数学问题解决方面仍面临挑战,尤其是在复杂推理和数值精度上。本文提出了一种名为“执行驱动的推理增强”(EDRA)的新范式,旨在通过将模型生成的推理步骤与外部执行器(如Python解释器)紧密结合来解决这些问题。EDRA的核心思想是,模型在生成每一步推理时,都会立即通过执行器验证其正确性,并将执行结果反馈给模型,从而形成一个闭环的自我修正过程。这种方法不仅能有效识别并纠正推理链中的错误,还能显著提高数值计算的准确性。实验结果表明,EDRA在多个数学基准测试(如GSM8K、MATH和MMLU-Math)上均取得了显著的性能提升,超越了现有的先进方法。EDRA的优势在于其通用性,可与各种LLM架构和推理策略结合,为提升LLM在需要精确计算和逻辑推理任务中的表现提供了有效途径。该研究强调了外部工具在增强LLM能力方面的重要性,为未来LLM的数学

📄 English Summary

Enhancing Mathematical Problem Solving in LLMs through Execution-Driven Reasoning Augmentation

Large Language Models (LLMs) still face significant challenges in mathematical problem-solving, particularly concerning complex reasoning and numerical precision. This paper introduces a novel paradigm called Execution-Driven Reasoning Augmentation (EDRA), designed to address these limitations by tightly integrating model-generated reasoning steps with an external executor, such as a Python interpreter. The core idea behind EDRA is that as the LLM generates each reasoning step, its correctness is immediately verified by the executor. The execution results are then fed back to the model, creating a closed-loop self-correction process. This approach not only effectively identifies and rectifies errors within the reasoning chain but also significantly enhances the accuracy of numerical computations. Experimental results demonstrate that EDRA achieves substantial performance improvements across various mathematical benchmarks, including GSM8K, MATH, and MMLU-Math, outperforming existing state-of-the-art methods. A key advantage of EDRA lies in its versatility, as it can be combined with diverse LLM architectures and inference strategies. This provides an effective pathway to elevate LLM performance in tasks demanding precise calculations and logical reasoning. The research underscores the critical importance of external tools in augmenting LLM capabilities and opens new avenues for future research in mathematical reasoning for LLMs.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等