📄 中文摘要
LumberChunker 是一种利用大型语言模型(LLM)来决定长篇故事分段位置的技术,旨在创建更自然的文本块,从而帮助检索增强生成(RAG)系统获取正确的信息。长篇叙事文档通常具有明确的结构,如章节或部分,但这些单位往往过于宽泛,无法满足检索任务的需求。在较低层次上,重要的语义变化发生在这些较大段落内部,而没有明显的结构性断裂。仅通过格式提示(如段落或固定的令牌窗口)进行文本分割,可能会将属于同一叙事单元的段落分开,而将无关内容聚集在一起。这种结构与意义之间的不匹配,导致生成的文本块包含不完整或混合的信息。
📄 English Summary
LumberChunker: Long-Form Narrative Document Segmentation
LumberChunker is a technique that leverages large language models (LLMs) to determine where to segment long-form narratives, creating more natural text chunks that assist Retrieval Augmented Generation (RAG) systems in retrieving the right information. Long-form narrative documents typically have an explicit structure, such as chapters or sections, but these units are often too broad for retrieval tasks. At a lower level, significant semantic shifts occur within these larger segments without any visible structural breaks. When text is split solely based on formatting cues, like paragraphs or fixed token windows, passages belonging to the same narrative unit may be separated, while unrelated content can be grouped together. This misalignment between structure and meaning results in chunks that contain incomplete or mixed information.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等