在线约束马尔可夫决策过程的近最优样本复杂度

📄 中文摘要

安全性是强化学习中的一个基本挑战,尤其是在自动驾驶、机器人技术和医疗保健等现实应用中。为了解决这一问题,常常使用约束马尔可夫决策过程(CMDPs)来在优化性能的同时强制执行安全约束。然而,现有方法通常存在显著的安全违规或需要高样本复杂度才能生成近似最优策略。研究针对两种情境进行了探讨:放宽可行性,允许小的违规;严格可行性,则不允许任何违规。提出了一种基于模型的原始-对偶算法,平衡了遗憾和约束违规的界限,借鉴了在线强化学习和约束优化的技术。

📄 English Summary

Near-Optimal Sample Complexity for Online Constrained MDPs

Safety is a fundamental challenge in reinforcement learning (RL), particularly in real-world applications such as autonomous driving, robotics, and healthcare. To address this challenge, Constrained Markov Decision Processes (CMDPs) are employed to enforce safety constraints while optimizing performance. However, existing methods often lead to significant safety violations or require high sample complexity to generate near-optimal policies. This research addresses two scenarios: relaxed feasibility, where small violations are permissible, and strict feasibility, where no violations are allowed. A model-based primal-dual algorithm is proposed that balances regret and bounded constraint violations, drawing on techniques from online RL and constrained optimization.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等