引导冻结的大型语言模型:通过在线提示路由实现自适应社会对齐

📄 中文摘要

大型语言模型(LLMs)通常通过后期对齐(如强化学习人类反馈或直接偏好优化)进行治理,这在部署和推理过程中产生了相对静态的策略。然而,现实世界的安全性是一个全生命周期的问题:静态防御对不断演变的越狱行为的抵御能力下降,固定权重无法适应多元化和时变的安全规范。因此,推理时的治理成为一种必要的手段,以在不进行昂贵的再训练的情况下引导模型行为。为此,提出了共识聚类LinUCB赌博机(CCLUB),这是一个通过系统提示路由实现自适应社会对齐的统一框架。CCLUB采用保守的共识聚类机制,仅在效用和安全相似性图的交集内汇聚数据,从而增强模型的适应性和安全性。

📄 English Summary

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

Large language models (LLMs) are typically governed by post-training alignment methods such as Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), resulting in a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle challenge: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic and time-varying safety norms. This necessity leads to inference-time governance that steers behavior without the need for costly retraining. To address this, the Consensus Clustering LinUCB Bandit (CCLUB) is introduced as a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism that pools data only within the intersection of utility and safety similarity graphs, enhancing both adaptability and safety of the model.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等