PILOT:大型语言模型通过内化潜在优化轨迹进行规划

📄 中文摘要

多步骤推理中,战略规划至关重要,但紧凑型大型语言模型(LLMs)通常缺乏制定全局策略的能力,导致在长程任务中错误传播。分析揭示,LLMs具备潜在的推理能力,当以来自教师模型的明确计划为条件时,这些能力可以被激活。然而,在运行时依赖外部指导通常因延迟和可用性问题而不切实际。PILOT(Planning via Internalized Latent Optimization Trajectories)旨在解决这一限制,通过将教师模型的规划能力内化到较小的LLMs中。核心思想是,不是直接模仿教师模型的输出,而是通过一种新颖的训练范式,让学生模型学习教师模型生成高质量计划的潜在优化轨迹。

📄 English Summary

PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

Strategic planning is critical for multi-step reasoning, yet compact Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks. Analysis reveals that LLMs possess latent reasoning capabilities that can be unlocked when conditioned on explicit plans from a teacher model. However, runtime reliance on external guidance is often impractical due to latency and availability constraints. PILOT (Planning via Internalized Latent Optimization Trajectories) addresses this limitation by internalizing the planning capabilities of a teacher model into smaller LLMs. The core idea is that instead of directly mimicking the teacher model's outputs, the student model learns the latent optimization trajectories that the teacher model employs to generate high-quality plans, achieved through a novel training paradigm. Specifically, PILOT introduces a two-stage training methodology: In the first stage, the student model learns to generate intermediate states and decision sequences similar to the teacher model’s plans through imitation learning. This involves employing contrastive learning and reinforcement learning techniques to guide the student model in approximating the teacher's planning paths within the latent space. In the second stage, the student model further refines its planning capabilities through self-supervision and environmental feedback, without direct guidance from the teacher model, making the internalized planning trajectories more robust and adaptive. This approach allows compact LLMs to acquire planning abilities comparable to larger teacher models without incurring additional inference costs or external dependencies.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等