通过跨情境元强化学习提升大型语言模型的情境在线学习能力

📄 中文摘要

大型语言模型(LLMs)在情境学习方面展现出显著能力,但其在线学习能力仍受限于单次情境内的短视决策。本文提出一种新颖的跨情境元强化学习(Meta-RL)框架,旨在通过在多个情境中积累和利用经验,显著提升LLMs的情境在线学习能力。该框架将LLM视为一个策略,通过强化学习优化其在情境内进行决策和学习的元策略。具体而言,我们引入了情境内学习器(In-Context Learner)和情境间元学习器(Cross-Episode Meta-Learner)两个核心组件。情境内学习器负责在给定情境中根据当前观测和历史交互进行决策,并更新其内部状态以适应情境。情境间元学习器则通过在不同情境中收集的奖励信号,优化情境内学习器的初始化策略和学习机制,使其能够更有效地适应新情境。实验结果表明,我们的方法在多个在线学习任务上显著优于现有基线方法,包括传统的强化学习算法和基于提示工程

📄 English Summary

Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL

Large Language Models (LLMs) have demonstrated remarkable in-context learning capabilities, yet their online learning capacity remains constrained by myopic decision-making within single episodes. This paper introduces a novel cross-episode Meta-Reinforcement Learning (Meta-RL) framework designed to significantly enhance LLMs' in-context online learning by accumulating and leveraging experience across multiple episodes. Our framework treats the LLM as a policy, optimizing its meta-policy for decision-making and learning within an episode via reinforcement learning. Specifically, we propose two core components: the In-Context Learner and the Cross-Episode Meta-Learner. The In-Context Learner is responsible for making decisions within a given episode based on current observations and historical interactions, updating its internal state to adapt to the episode. The Cross-Episode Meta-Learner, conversely, optimizes the In-Context Learner's initialization strategy and learning mechanisms by utilizing reward signals collected across different episodes, enabling more effective adaptation to new scenarios. Experimental results demonstrate that our method significantly outperforms existing baselines, including traditional reinforcement learning algorithms and prompt engineering-based LLM approaches, across various online learning tasks. This validates that through the Meta-RL paradigm, LLMs can better leverage cross-episode experience, achieving more robust and efficient in-context online learning, thereby laying a foundation for building more intelligent and adaptive AI systems.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等