迈向通用语义块划分:一种针对超长文档的判别框架

📄 中文摘要

长文档主题分割在信息检索和文档理解中起着重要作用,但现有方法在超长文本设置中仍存在明显不足。传统的判别模型受限于固定窗口,无法建模文档级语义;生成性大型语言模型虽然可以输出段落边界,但推理成本高且难以支持长输入。为了解决这些问题,提出了一种基于Qwen3-0.6B的判别分割模型。在基础网络之上,增加了跨窗口上下文融合层和边界分类头,并结合重叠滑动窗口策略。该模型支持单次输入长度达到13k字符。

📄 English Summary

Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents

Long-document topic segmentation plays a crucial role in information retrieval and document understanding, yet existing methods exhibit significant shortcomings in ultra-long text scenarios. Traditional discriminative models are limited by fixed windows and fail to capture document-level semantics. Although generative large language models can produce paragraph boundaries, they incur high inference costs and struggle with long inputs. To address these challenges, a discriminative segmentation model based on Qwen3-0.6B is proposed. This model incorporates a cross-window context fusion layer and a boundary classification head on top of the backbone network, combined with an overlapping sliding-window strategy. The model supports single-pass inputs of up to 13k characters.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等