层次嵌入融合用于检索增强的代码生成

📄 中文摘要

检索增强的代码生成通常依赖于大型检索代码片段来为解码器提供条件,这使得在线推理成本与代码库的大小相关,并引入了来自长上下文的噪声。提出了一种名为层次嵌入融合(HEF)的两阶段方法,用于代码补全的代码库表示。首先,离线缓存通过一个小型融合模型将代码库块压缩为可重用的密集向量层次结构。其次,在线接口将少量检索到的向量映射为学习到的伪标记,这些伪标记被代码生成器使用。这种方法用固定的伪标记预算替代了数千个检索标记,同时保留了对代码库级信息的访问。在RepoBench和RepoEval上,HEF在一个1.8B参数的模型中表现出色。

📄 English Summary

Hierarchical Embedding Fusion for Retrieval-Augmented Code Generation

Retrieval-augmented code generation often relies on large retrieved code snippets to condition the decoder, tying online inference costs to repository size and introducing noise from long contexts. A two-stage approach called Hierarchical Embedding Fusion (HEF) is proposed for repository representation in code completion. First, an offline cache compresses repository chunks into a reusable hierarchy of dense vectors using a small fuser model. Second, an online interface maps a small number of retrieved vectors into learned pseudo-tokens that are consumed by the code generator. This approach replaces thousands of retrieved tokens with a fixed pseudo-token budget while preserving access to repository-level information. HEF demonstrates strong performance on RepoBench and RepoEval with a 1.8B-parameter model.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等