学习选择视觉上下文示范

📄 中文摘要

多模态大型语言模型(MLLMs)通过上下文学习(ICL)适应视觉任务,而示范的质量对其效果至关重要。目前主流的示范选择策略是无监督的k-最近邻(kNN)搜索。尽管该方法简单,但对于复杂的事实回归任务而言,其相似性优先的策略并不理想,往往选择冗余示例,无法全面捕捉任务的输出范围。该研究将选择过程重新定义为一个序列决策问题,并提出了学习选择示范(LSD)的方法,训练强化学习代理构建最优示范集。通过使用对抗DQN和以查询为中心的Transformer解码器,代理学习一种策略,以最大化MLLM的下游性能。

📄 English Summary

Learning to Select Visual In-Context Demonstrations

Multimodal Large Language Models (MLLMs) adapt to visual tasks through in-context learning (ICL), which is heavily reliant on the quality of demonstrations. The prevalent demonstration selection strategy employs unsupervised k-Nearest Neighbor (kNN) search. While this approach is straightforward, it proves sub-optimal for complex factual regression tasks, as it tends to select redundant examples that do not adequately represent the full output range of the task. This research reframes the selection process as a sequential decision-making problem and introduces Learning to Select Demonstrations (LSD), which trains a Reinforcement Learning agent to construct optimal demonstration sets. Utilizing a Dueling DQN with a query-centric Transformer Decoder, the agent learns a policy that maximizes the downstream performance of MLLMs.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等