AOI:将失败轨迹转化为自主云诊断的训练信号

📄 中文摘要

大型语言模型(LLM)代理为自动化站点可靠性工程(SRE)提供了一种有前景的数据驱动方法,但其企业部署受到三大挑战的限制:对专有数据的访问受限、在权限管理环境下执行不安全操作的风险,以及封闭系统无法从失败中改进的问题。AOI(自主操作智能)提出了一种可训练的多代理框架,将自动化操作形式化为在安全约束下的结构化轨迹学习问题。该方法整合了三个关键组件:首先,一个可训练的诊断系统应用了群体相对策略优化(GRPO),将专家级知识提炼为本地部署的开源模块。

📄 English Summary

AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

Large language model (LLM) agents provide a promising data-driven approach to automating Site Reliability Engineering (SRE), yet their enterprise deployment faces three main challenges: restricted access to proprietary data, unsafe action execution in permission-governed environments, and the inability of closed systems to learn from failures. AOI (Autonomous Operations Intelligence) introduces a trainable multi-agent framework that formulates automated operations as a structured trajectory learning problem under security constraints. This approach integrates three key components. Firstly, a trainable diagnostic system employs Group Relative Policy Optimization (GRPO) to distill expert-level knowledge into locally deployed open-source modules.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等