扩散语言模型中的记忆特征：广义提取与采样效应

出处: Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

发布: 2026年3月4日

📄 中文摘要

自回归语言模型（ARMs）已被证明能够记忆并偶尔逐字重现训练数据，这引发了关于隐私和版权责任的担忧。扩散语言模型（DLMs）作为一种竞争性替代方案最近出现，但由于生成动态的根本差异，其记忆行为仍然未得到充分探索。为了解决这一空白，提出了一种系统的理论和实证框架，以表征DLMs中的记忆特征。提出了一种广义概率提取框架，该框架统一了前缀条件解码和基于扩散的生成，适用于任意掩蔽模式和随机采样轨迹。定理4.3建立了采样之间的单调关系，进一步揭示了DLMs在记忆方面的独特特征。

🏷️ 相关标签

#扩散语言模型 #记忆特征 #自回归语言模型 #概率提取 #生成动态

📄 English Summary

Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

Autoregressive language models (ARMs) have been shown to memorize and occasionally reproduce training data verbatim, raising concerns about privacy and copyright liability. Diffusion language models (DLMs) have recently emerged as a competitive alternative, yet their memorization behavior remains largely unexplored due to fundamental differences in generation dynamics. This study presents a systematic theoretical and empirical characterization of memorization in DLMs. A generalized probabilistic extraction framework is proposed, unifying prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Theorem 4.3 establishes a monotonic relationship between sampling, further revealing the unique characteristics of memorization in DLMs.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误