大型语言模型的推理能力如何?在文本游戏环境中评估多步骤演绎推理

📄 中文摘要

在推理谁是罪犯的过程中,大型语言模型(LLM)代理面临挑战。研究实现了经典桌游《克鲁》的文本基础多代理版本,作为评估多步骤演绎推理的规则基础测试平台,参与的六个代理来自GPT-4o-mini和Gemini-2.5-Flash。进一步探讨了在结构化逻辑难题上的微调是否能转移到游戏内推理和游戏玩法的改善。在18场模拟游戏中,代理仅获得四次正确胜利,表明在整个游戏过程中保持一致的演绎推理存在困难。此外,发现微调并未可靠地提高性能,在某些情况下,似乎增加了推理的数量,但并未改善推理的精准度。

📄 English Summary

How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

Deducing the culprit poses challenges for large language model (LLM) agents. A text-based multi-agent version of the classic board game Clue is implemented as a rule-based testbed for evaluating multi-step deductive reasoning, involving six agents sourced from GPT-4o-mini and Gemini-2.5-Flash. The study further investigates whether fine-tuning on structured logic puzzles translates to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieved only four correct wins, indicating difficulties in maintaining consistent deductive reasoning throughout a full game. Additionally, it was found that fine-tuning does not reliably enhance performance and, in some instances, appears to increase the volume of reasoning without improving its precision.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等