超越拒绝:探讨代理自我修正对语义敏感信息的限制

📄 中文摘要

随着对结构化个人身份信息(PII)防御的成熟,大型语言模型(LLMs)带来了新的威胁:语义敏感信息(SemSI)。这些模型可能推断出敏感身份属性,生成有损声誉的内容,或虚构潜在错误的信息。如何在不损害实用性的情况下,自我调节这些复杂且依赖上下文的敏感信息泄漏,仍然是一个未解的科学问题。为了解决这一问题,提出了SemSIEdit,这是一种推理时框架,其中一个代理“编辑器”迭代地批评和重写敏感内容,以保持叙事流畅,而不仅仅是拒绝回答。分析结果揭示了隐私-效用的帕累托前沿,该代理重写方法减少了34.6%的信息泄漏。

📄 English Summary

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

The emergence of Large Language Models (LLMs) has introduced a new threat: Semantic Sensitive Information (SemSI), which includes the inference of sensitive identity attributes, generation of reputation-harming content, and hallucination of potentially incorrect information. The challenge lies in the ability of LLMs to self-regulate these complex and context-dependent leaks of sensitive information without compromising their utility. To tackle this issue, SemSIEdit is proposed as an inference-time framework where an agentic 'Editor' iteratively critiques and rewrites sensitive spans to maintain narrative flow instead of simply refusing to respond. The analysis reveals a Privacy-Utility Pareto Frontier, indicating that this agentic rewriting approach reduces leakage by 34.6%.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等