📄 中文摘要
组合动作空间的强化学习因其指数级的可行动作集和复杂的约束条件而极具挑战性,导致直接的策略参数化变得不切实际。现有方法通常将任务特定的价值函数嵌入到约束优化程序中,或者学习确定性的结构化策略,但这牺牲了通用性和策略表现力。针对这些局限性,提出一种求解器诱导的潜在球面流策略(Latent Spherical Flow Policy),旨在更有效地处理组合动作空间。该方法的核心思想是将高维、离散的组合动作空间映射到一个低维、连续的潜在球形流形上,并在该流形上学习策略。
📄 English Summary
Latent Spherical Flow Policy for Reinforcement Learning with Combinatorial Actions
Reinforcement learning (RL) with combinatorial action spaces remains a significant challenge due to the exponentially large sets of feasible actions and the intricate nature of their underlying feasibility constraints, which render direct policy parameterization impractical. Current approaches typically embed task-specific value functions into constrained optimization programs or learn deterministic structured policies. However, these methods often compromise generality and policy expressiveness. To address these limitations, a solver-induced Latent Spherical Flow Policy is proposed for more effective handling of combinatorial action spaces. The core idea is to map the high-dimensional, discrete combinatorial action space into a lower-dimensional, continuous latent spherical manifold, and then learn the policy within this manifold. By leveraging the geometric properties of the manifold and the powerful expressive capabilities of flow models, this policy can generate diverse and constraint-compliant actions, thereby avoiding reliance on explicit constrained optimization. Specifically, the Latent Spherical Flow Policy first encodes the features of combinatorial actions into a latent space via an encoder. Subsequently, a normalizing flow model is applied on a spherical manifold to model the conditional distribution of actions. Finally, a decoder maps samples from the latent space back to the original combinatorial action space. This approach not only generates actions that adhere to complex combinatorial constraints but also maintains the stochasticity and diversity of the policy, achieving a better balance between exploration and exploitation.