📄 中文摘要
高效的长上下文大语言模型(LLM)部署面临着两难局面:一种是摊销压缩,难以进行超出分布的泛化;另一种是测试时训练,成本高昂且需要修改模型权重,导致状态参数复杂化,影响并发服务。提出的潜在上下文编译框架从适应性转向编译,利用一次性LoRA模块作为编译器,将长上下文提炼为紧凑的缓冲令牌——无状态、可移植的记忆工件,与冻结的基础模型兼容。关键在于引入自对齐优化策略,消除了对合成数据的需求,简化了长上下文的处理过程。
📄 English Summary
Latent Context Compilation: Distilling Long Context into Compact Portable Memory
Efficient deployment of long-context large language models (LLMs) is hindered by a dichotomy between amortized compression, which struggles with out-of-distribution generalization, and Test-Time Training, which incurs high costs for synthetic data and necessitates modifications to model weights, complicating concurrent serving with stateful parameters. The proposed Latent Context Compilation framework shifts context processing from adaptation to compilation. By utilizing a disposable LoRA module as a compiler, long contexts are distilled into compact buffer tokens—stateless, portable memory artifacts that are plug-and-play compatible with frozen base models. A self-aligned optimization strategy is introduced, eliminating the need for synthetic data and streamlining the processing of long contexts.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等