最佳尾部:在推理时对齐中架起乐观与悲观的桥梁

📄 中文摘要

推理时对齐通过从参考模型生成多个候选项并利用不完美的奖励模型进行选择,有效地引导大型语言模型(LLMs)。然而,当前策略面临根本性困境:乐观方法如Best-of-$N$容易遭遇奖励黑客,而悲观的正则化方法往往抑制了发现高质量响应所需的探索。通过遗憾最小化的视角,研究正式化了这一权衡,表明最优策略在很大程度上依赖于奖励分布的尾部行为。理论分析表明,轻尾分布更倾向于乐观,以发现高质量的异常值,而重尾分布则需要采取更为谨慎的策略。

📄 English Summary

Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

Inference-time alignment effectively guides large language models (LLMs) by generating multiple candidates from a reference model and selecting among them using an imperfect reward model. However, current strategies encounter a fundamental dilemma: 'optimistic' approaches like Best-of-$N$ are prone to reward hacking, while 'pessimistic' regularized methods often stifle the exploration necessary to uncover high-quality responses. This research formalizes this trade-off through the lens of regret minimization, demonstrating that the optimal strategy critically depends on the tail behavior of the reward distribution. Theoretical analysis reveals that light-tailed regimes favor optimism to discover high-quality outliers, whereas heavy-tailed regimes necessitate a more cautious approach.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等