多模态提示优化:为何不利用多种模态来优化大规模语言模型

📄 中文摘要

多模态大规模语言模型(MLLMs)需要联合提示搜索,而不仅仅局限于文本提示。多模态提示优化(MPO)通过对文本和非文本输入进行联合优化,采用保持对齐的更新方式,确保解码器行为的稳定性,并使用贝叶斯选择器重用过去的评估作为先验。具体实施步骤包括:将非文本输入参数化为提示嵌入,冻结解码器并对提示向量应用保持对齐的更新,以及利用先前评估的贝叶斯获取方法来聚焦候选项。研究表明,联合多模态提示优化在性能上优于仅优化文本的方法,并且能够减少资源消耗。

📄 English Summary

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

Multimodal Large Language Models (MLLMs) require joint prompt search beyond just text prompts. Multimodal Prompt Optimization (MPO) jointly optimizes both text and non-text inputs using alignment-preserving updates to maintain stable decoder behavior, alongside a Bayesian selector that reuses past evaluations as priors. Practical implementation involves parameterizing non-text inputs as prompt embeddings, freezing the decoder while applying alignment-preserving updates to prompt vectors, and employing a Bayesian acquisition method that leverages prior evaluations to focus on candidates. The findings indicate that joint multimodal prompt optimization outperforms text-only tuning and reduces resource consumption.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等