📄 中文摘要
现有的基础视频到视频扩散模型在通过修改外观、运动或摄像机移动来编辑用户提供的视频方面取得了显著进展。然而,现实世界的视频编辑通常是一个迭代过程,用户需要通过多轮交互来细化结果。在这种多轮编辑场景中,当前的视频编辑器难以在连续编辑之间保持跨帧一致性。模型在处理多轮编辑时,往往会丢失先前编辑轮次中建立的风格、内容或结构信息,导致后续编辑生成的结果与早期版本不协调,从而降低了用户体验和编辑效率。例如,在对视频进行多次风格迁移或对象替换后,模型可能无法保持对象在不同轮次间的身份一致性,或者视频的整体风格在连续编辑中发生漂移。解决这一挑战的关键在于引入记忆机制,使模型能够跨编辑轮次存储和检索关键信息。这种记忆机制可以包括对用户意图、编辑历史、以及视频内容在不同编辑阶段的中间表示的编码。通过集成记忆模块,模型能够在进行新的编辑时,参考之前的编辑状态,从而确保视频在整个编辑过程中保持高度的一致性和连贯性。这不仅能提高编辑结果的质量,还能显著提升用户在迭代编辑流程中的满意度。
📄 English Summary
Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory
Recent foundational video-to-video diffusion models have achieved impressive results in editing user-provided videos by modifying appearance, motion, or camera movement. However, real-world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi-turn setting, current video editors struggle to maintain cross-consistency across sequential edits. Models often lose stylistic, content, or structural information established in previous editing rounds when handling multiple turns, leading to results generated in subsequent edits that are uncoordinated with earlier versions, thereby degrading user experience and editing efficiency. For instance, after performing multiple style transfers or object replacements on a video, the model might fail to maintain object identity consistency across different rounds, or the overall video style might drift during continuous edits. The key to addressing this challenge lies in introducing a memory mechanism that allows the model to store and retrieve critical information across editing rounds. This memory mechanism could involve encoding user intent, editing history, and intermediate representations of video content at different stages of editing. By integrating a memory module, the model can refer to previous editing states when performing new edits, thereby ensuring high consistency and coherence throughout the entire video editing process. This not only improves the quality of the editing results but also significantly enhances user satisfaction in iterative editing workflows.