可解释的大型语言模型的去学习通过推理实现

出处: Explainable LLM Unlearning Through Reasoning

发布: 2026年3月12日

📄 中文摘要

大型语言模型(LLM)的去学习对于缓解预训练模型中的安全性、版权和隐私问题至关重要。与偏好对齐相比,去学习提供了一种更明确的方法,通过移除特定去学习数据集所表征的不良知识。以往的研究中,梯度上升(GA)及其变体在实现去学习方面显示出潜力,但其无针对性的特性导致了通用能力的意外下降、知识的未完全移除以及生成不连贯的响应等问题。这些问题的根源在于缺乏明确的指导,指导模型应当去学习什么以及如何去学习。为了解决这一问题,提出了一种新的去学习目标,即基于推理的去学习方法。该方法旨在提供更清晰的去学习方向,提升模型的性能和一致性。

📄 English Summary

Explainable LLM Unlearning Through Reasoning

LLM unlearning is crucial for addressing safety, copyright, and privacy issues in pre-trained large language models (LLMs). Unlike preference alignment, unlearning provides a more explicit approach by removing undesirable knowledge characterized by specific unlearning datasets. Previous works utilizing gradient ascent (GA) and its variants have shown promise in implementing unlearning; however, their untargeted nature leads to unintended degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses. These issues arise from the lack of explicit guidance on what and how models should unlearn. To address this gap, a novel unlearning target is introduced, focusing on reasoning-based unlearning. This approach aims to provide clearer directions for unlearning, enhancing model performance and coherence.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等