稀疏感知低秩表示用于大语言模型高效微调

📄 中文摘要

将大型预训练语言模型应用于下游任务时,通常需要微调数百万个参数或部署高昂的密集权重更新,这严重限制了它们在资源受限环境中的应用。低秩适应(LoRA)通过分解权重更新来减少可训练参数,但其底层密集权重仍然导致高存储和计算成本。基于幅度的剪枝方法可以产生稀疏模型,但通常会降低模型性能或需要复杂的再训练策略。针对这些挑战,一种稀疏感知低秩表示(Sparsity-Aware Low-Rank Representation, SALOR)方法被提出,旨在通过在低秩更新矩阵中引入结构化稀疏性来进一步提高微调效率。SALOR的核心思想是,在进行低秩分解的同时,利用剪枝技术识别并移除对模型性能贡献较小的低秩矩阵元素,从而实现更极致的参数压缩和计算优化。具体而言,SALOR在LoRA的增量更新矩阵(ΔW = AB)上应用稀疏性约束,使得矩阵A和B中的大部分元素为零,从而减少了需要存储和计算的非零参数数量。这种方法不仅继承了LoRA减少可训练参数的优势,还通过稀疏化低秩分解结果,进一步降低了存储需求和推理时的计算复杂度。与传统的LoRA相比,SALOR能够在保持甚至提升模型性能的同时,显著减少微调所需的参数量和推理延迟。实验结果表明,SALOR在多种下游任务上,相较于LoRA及其他剪枝方法,展现出更优的参数效率和计算效率,使其成为在边缘设备或计算资源有限场景下部署大型语言模型的理想选择。该技术通过结合低秩适应和结构化稀疏性,为大语言模型的高效微调提供了一种新颖且有效的解决方案。

📄 English Summary

Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

Adapting large pre-trained language models to downstream tasks frequently necessitates fine-tuning millions of parameters or deploying computationally expensive dense weight updates, severely impeding their deployment in resource-constrained environments. Low-Rank Adaptation (LoRA) mitigates the number of trainable parameters by factorizing weight updates, yet the inherent dense base weights continue to impose substantial storage and computational overheads. Magnitude-based pruning techniques can yield sparse models but typically suffer from performance degradation or require intricate re-training strategies. Addressing these limitations, a novel Sparsity-Aware Low-Rank Representation (SALOR) method is proposed to further enhance fine-tuning efficiency by introducing structured sparsity into the low-rank update matrices. The core principle of SALOR involves leveraging pruning techniques to identify and remove low-contributing elements within the low-rank matrices during the decomposition process, thereby achieving more extreme parameter compression and computational optimization. Specifically, SALOR applies sparsity constraints to the incremental update matrices (ΔW = AB) in LoRA, ensuring that a significant portion of elements in matrices A and B are zero. This approach not only retains LoRA's advantage of reducing trainable parameters but also further decreases storage requirements and inference computational complexity by sparsifying the low-rank decomposition results. Compared to traditional LoRA, SALOR can significantly reduce the number of parameters required for fine-tuning and inference latency while maintaining or even improving model performance. Experimental evaluations demonstrate that SALOR achieves superior parameter and computational efficiency across various downstream tasks compared to LoRA and other pruning methods, positioning it as an ideal solution for deploying large language models on edge devices or in settings with limited computational resources. This technique offers a novel and effective solution for efficient fine-tuning of large language models by integrating low-rank adaptation with structured sparsity.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等