大语言模型中的 H-Node 攻击与防御

📄 中文摘要

研究提出了 H-Node 对抗噪声消除(H-Node ANC)机制框架,旨在识别、利用和防御基于变换器的大语言模型(LLMs)中的幻觉表示,重点关注单个隐藏状态维度。通过对最后一个令牌的隐藏状态进行逻辑回归探测,能够将幻觉信号定位到一小组高方差维度,这些维度被称为幻觉节点(H-Nodes),并且在四种架构中探测的 AUC 达到 0.90。在推理时,白盒对抗攻击通过实时前向钩子放大这些维度,选择性达到 3.02 倍,而对防御者的可见性不足 10%。自适应 ANC 防御机制通过信心抑制 H-Node 过量现象。

📄 English Summary

H-Node Attack and Defense in Large Language Models

The study presents H-Node Adversarial Noise Cancellation (H-Node ANC), a mechanistic framework designed to identify, exploit, and defend against hallucination representations in transformer-based large language models (LLMs) at the level of individual hidden-state dimensions. A logistic regression probe trained on last-token hidden states localizes the hallucination signal to a small set of high-variance dimensions, termed Hallucination Nodes (H-Nodes), achieving a probe AUC of 0.90 across four architectures. A white-box adversarial attack amplifies these dimensions at inference time via a real-time forward hook, achieving a selectivity of 3.02x with less than 10% visibility to the defender. An adaptive ANC defense mechanism suppresses excess H-Node presence using confidence measures.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等