📄 中文摘要
成功的大型预训练变换器与标记器密切相关,标记器将原始输入转换为离散符号。将这些模型扩展到图结构数据仍然是一个重大挑战。研究提出了一种图标记化框架,通过结合可逆图序列化和广泛采用的字节对编码(BPE),生成图的顺序表示。为了更好地捕捉结构信息,图序列化过程受到图子结构全局统计的指导,确保频繁出现的子结构在序列中更常出现,并能被BPE合并为有意义的标记。实验证明了该方法在图数据处理中的有效性。
📄 English Summary
Graph Tokenization for Bridging Graphs and Transformers
The success of large pretrained Transformers is closely tied to tokenizers that convert raw input into discrete symbols. Extending these models to graph-structured data remains a significant challenge. This research introduces a graph tokenization framework that generates sequential representations of graphs by combining reversible graph serialization, which preserves graph information, with Byte Pair Encoding (BPE), a widely adopted tokenizer in large language models (LLMs). To better capture structural information, the graph serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear more often in the sequence and can be merged by BPE into meaningful tokens. Empirical results demonstrate the effectiveness of this approach in processing graph data.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等