引导冻结的大型语言模型：通过在线提示路由实现自适应社会对齐

出处: Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

发布: 2026年3月18日

📄 中文摘要

大型语言模型（LLMs）通常通过后期对齐（如强化学习人类反馈或直接偏好优化）进行治理，这在部署和推理过程中产生了相对静态的策略。然而，现实世界的安全性是一个全生命周期的问题：静态防御对不断演变的越狱行为的抵御能力下降，固定权重无法适应多元化和时变的安全规范。因此，推理时的治理成为一种必要的手段，以在不进行昂贵的再训练的情况下引导模型行为。为此，提出了共识聚类LinUCB赌博机（CCLUB），这是一个通过系统提示路由实现自适应社会对齐的统一框架。CCLUB采用保守的共识聚类机制，仅在效用和安全相似性图的交集内汇聚数据，从而增强模型的适应性和安全性。

🏷️ 相关标签

#大型语言模型 #自适应对齐 #推理治理 #共识聚类 #安全性

📄 English Summary

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

Large language models (LLMs) are typically governed by post-training alignment methods such as Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), resulting in a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle challenge: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic and time-varying safety norms. This necessity leads to inference-time governance that steers behavior without the need for costly retraining. To address this, the Consensus Clustering LinUCB Bandit (CCLUB) is introduced as a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism that pools data only within the intersection of utility and safety similarity graphs, enhancing both adaptability and safety of the model.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误