超越拒绝：探讨代理自我修正对语义敏感信息的限制

出处: Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

发布: 2026年2月26日

📄 中文摘要

随着对结构化个人身份信息（PII）防御的成熟，大型语言模型（LLMs）带来了新的威胁：语义敏感信息（SemSI）。这些模型可能推断出敏感身份属性，生成有损声誉的内容，或虚构潜在错误的信息。如何在不损害实用性的情况下，自我调节这些复杂且依赖上下文的敏感信息泄漏，仍然是一个未解的科学问题。为了解决这一问题，提出了SemSIEdit，这是一种推理时框架，其中一个代理“编辑器”迭代地批评和重写敏感内容，以保持叙事流畅，而不仅仅是拒绝回答。分析结果揭示了隐私-效用的帕累托前沿，该代理重写方法减少了34.6%的信息泄漏。

🏷️ 相关标签

#语义敏感信息 #大型语言模型 #隐私保护 #信息泄漏 #自我修正

📄 English Summary

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

The emergence of Large Language Models (LLMs) has introduced a new threat: Semantic Sensitive Information (SemSI), which includes the inference of sensitive identity attributes, generation of reputation-harming content, and hallucination of potentially incorrect information. The challenge lies in the ability of LLMs to self-regulate these complex and context-dependent leaks of sensitive information without compromising their utility. To tackle this issue, SemSIEdit is proposed as an inference-time framework where an agentic 'Editor' iteratively critiques and rewrites sensitive spans to maintain narrative flow instead of simply refusing to respond. The analysis reveals a Privacy-Utility Pareto Frontier, indicating that this agentic rewriting approach reduces leakage by 34.6%.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误