基于历史条件多模态大模型的非马尔可夫多轮对话图像生成

📄 中文摘要

多轮对话图像生成要求模型能够根据用户指令,在多轮交互中生成图像,并以文本和图像交织的形式,将对话历史作为生成依据。尽管最新的多模态大型语言模型(MLLMs)已具备图像生成和编辑能力,但目前大多数多轮基准测试和训练方法在实质上仍是马尔可夫式的。这意味着模型的下一个输出主要依赖于最近一次生成的图像,这种依赖性导致模型倾向于采用捷径方案,从而忽视了完整的对话历史信息。当对话历史信息被忽略时,模型难以捕捉到长程依赖关系和复杂的语境变化,使得生成图像在语义一致性和连贯性上大打折扣。

📄 English Summary

Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs

Multi-round conversational image generation necessitates models to adhere to user instructions across multiple interactions, grounding their outputs in interleaved text and images that accumulate as chat history. While contemporary multimodal large language models (MLLMs) are capable of generating and editing images, most existing multi-turn benchmarks and training paradigms are effectively Markovian. This implies that the subsequent output primarily hinges on the most recent image, fostering shortcut solutions that neglect the comprehensive conversational history. When the full dialogue history is disregarded, models struggle to capture long-range dependencies and intricate contextual shifts, significantly compromising the semantic consistency and coherence of generated images. For instance, in scenarios where users request multiple fine-grained modifications to the same image, if the model exclusively focuses on the last modification instruction and the preceding image, overlooking all prior modification history, the generated result may deviate from the user's initial intent or even exhibit semantic conflicts. This limitation of the Markovian assumption prevents models from truly understanding and leveraging the cumulative context of the conversation, thereby restricting their performance in complex, long-range dialogue settings. To surmount this challenge, new model architectures and training strategies are required that enable models to effectively integrate and utilize complete conversational history information for more accurate and coherent multi-round conversational image generation. This includes designing mechanisms capable of encoding and remembering multi-modal historical data, as well as establishing metrics to evaluate long-term consistency in non-Markovian dialogues.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等