📄 中文摘要
针对上层目标为平滑函数、下层问题为马尔可夫决策过程(MDP)中的策略优化的结构化双层优化问题,提出一种正则化Actor-Critic算法。上层决策变量对下层MDP的奖励进行参数化,并且上层目标依赖于最优诱导策略。现有双层优化和强化学习方法通常需要二阶信息,并对问题结构施加严格的正则化条件。这些方法在实践中可能面临计算复杂度和收敛性挑战。本算法通过引入正则化项,在不依赖二阶信息的情况下,有效地处理了双层优化问题。该正则化项旨在稳定训练过程,并促进更快的收敛。算法的核心在于将上层优化与下层策略学习解耦,通过迭代更新上层变量和下层策略。在下层策略优化中,结合了Actor-Critic框架,利用Actor更新策略,利用Critic估计价值函数。正则化项的具体形式被设计为鼓励上层变量的变化具有一定的平滑性或稀疏性,从而避免剧烈的参数波动,提高算法的鲁棒性。算法的理论分析证明了在特定假设下,该方法能够收敛到双层问题的局部最优解。实验结果表明,该正则化Actor-Critic算法在多个双层强化学习任务中表现出优越的性能,尤其在处理复杂奖励结构和高维状态空间时,其收敛速度和稳定性均显著优于现有基线方法。算法的计算效率也得到了提升,使其更适用于大规模实际应用。
📄 English Summary
A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning
A regularized Actor-Critic algorithm is proposed for a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem involves policy optimization within a Markov Decision Process (MDP). The upper-level decision variable parameterizes the reward of the lower-level MDP, and the upper-level objective depends on the optimal induced policy. Existing methods for bi-level optimization and reinforcement learning often necessitate second-order information and impose stringent regularization conditions on the problem structure, leading to potential computational complexity and convergence issues in practice. This algorithm effectively addresses the bi-level optimization challenge by introducing a regularization term, thereby circumventing the reliance on second-order information. The regularization term is designed to stabilize the training process and accelerate convergence. The core of the algorithm lies in decoupling the upper-level optimization from the lower-level policy learning, iteratively updating the upper-level variables and the lower-level policy. For the lower-level policy optimization, an Actor-Critic framework is integrated, utilizing the Actor to update the policy and the Critic to estimate the value function. The specific form of the regularization term is crafted to encourage smoothness or sparsity in the changes of the upper-level variables, preventing drastic parameter fluctuations and enhancing algorithmic robustness. Theoretical analysis demonstrates that, under specific assumptions, the method converges to a local optimum of the bi-level problem. Experimental results show that the regularized Actor-Critic algorithm achieves superior performance across various bi-level reinforcement learning tasks, particularly in handling complex reward structures and high-dimensional state spaces, where its convergence speed and stability significantly outperform existing baseline methods. The computational efficiency of the algorithm is also improved, making it more suitable for large-scale practical applications.