群体正交化策略优化:作为希尔伯特空间中的正交投影的群体策略优化

📄 中文摘要

提出了一种新的对齐算法——群体正交化策略优化(GOPO),该算法源于希尔伯特函数空间的几何特性。GOPO不再在概率单纯形上进行优化,而是将对齐提升到与参考策略相关的平方可积函数的希尔伯特空间L2(pi_k)中。在这个空间内,单纯形约束简化为线性正交性条件<v, 1> = 0,定义了一个余维为一的子空间H0。通过最小化与无约束目标u_star的距离,得到工作耗散泛函J(v) = <g, v> - (mu / 2) ||v||^2,其最大化者直接源于希尔伯特投影定理。

📄 English Summary

Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

A new alignment algorithm called Group Orthogonalized Policy Optimization (GOPO) is presented, derived from the geometry of Hilbert function spaces. Instead of optimizing on the probability simplex and inheriting the exponential curvature of Kullback-Leibler divergence, GOPO elevates alignment into the Hilbert space L2(pi_k) of square-integrable functions concerning the reference policy. Within this space, the simplex constraint reduces to a linear orthogonality condition <v, 1> = 0, defining a codimension-one subspace H0. Minimizing the distance to an unconstrained target u_star yields the work-dissipation functional J(v) = <g, v> - (mu / 2) ||v||^2, whose maximizer follows directly from the Hilbert projection theorem.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等