超越向量搜索:构建金融领域的条款森林(FoC)架构

📄 中文摘要

在金融和医疗等高度层次化的长文档领域,传统的向量搜索(包括混合检索)在处理跨章节逻辑比较查询时面临结构性召回失败的问题。为此,设计了条款森林(FoC)架构,将文档的目录提升为“第一公民”。该架构采用双引擎并发检索,结合自上而下的LLM树结构路由和自下而上的向量片段搜索,在内存中组装出精确的“子树”。此外,构建了一个自定义的$O(N)$栈式解析器,以动态构建具有非标准层次结构的条款森林,克服了工程上的障碍。

📄 English Summary

Beyond Vector Search: Building a Clause Forest (FoC) Architecture for Financial RAG

In highly hierarchical, long-document domains such as finance and healthcare, traditional vector search, including hybrid retrieval, suffers from structural recall failure when addressing cross-chapter logical comparison queries. To tackle this issue, the Forest of Clauses (FoC) architecture has been designed, elevating the document's table of contents to a 'first-class citizen.' This architecture employs a dual-engine concurrent retrieval approach, combining Top-down LLM tree-structure routing and Bottom-up vector fragment search, to assemble a precise 'subtree' in memory. Additionally, a custom $O(N)$ stack-based parser has been built to dynamically construct clause forests with non-standard hierarchies, overcoming engineering barriers.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等