一种保留Zipf定律的长程相关代理模型用于书面语言及其他符号序列

📄 中文摘要

符号序列如书面语言和基因组DNA展现出特征性的频率分布和跨越多个符号的长程相关性。在语言中,这表现为单词频率的Zipf定律以及跨越数百或数千个符号的持久相关性;而在DNA中,则反映在核苷酸组成和在嘌呤-嘧啶映射下的长记忆游走中。现有的代理模型通常只能保留频率分布或相关性特征,而无法同时保留这两者。提出了一种新的代理模型,能够同时满足这两个约束条件:它保留了原始序列的经验符号频率,并重现了其长程相关结构。

📄 English Summary

A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences

Symbolic sequences such as written language and genomic DNA exhibit characteristic frequency distributions and long-range correlations extending across many symbols. In language, this manifests as Zipf's law for word frequencies along with persistent correlations spanning hundreds or thousands of tokens, while in DNA, it is reflected in nucleotide composition and long-memory walks under purine-pyrimidine mappings. Existing surrogate models typically preserve either frequency distribution or correlation properties, but not both simultaneously. A new surrogate model is introduced that retains both constraints: it preserves the empirical symbol frequencies of the original sequence and reproduces its long-range correlation structure.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等