VAMPO:通过策略优化提升视频动作模型中的视觉动态
📄 中文摘要
视频动作模型为视觉-语言-动作系统提供了良好的基础,因为它们能够从大规模视频数据中学习视觉动态,并将这些知识转移到下游的机器人控制中。然而,目前基于扩散的视频预测器使用的似然替代目标,鼓励生成全局上合理的预测,但并未明确优化操控所需的精确视觉动态。这种目标不匹配常常导致物体姿态、空间关系和接触时机等方面的细微错误,这些错误在下游策略中可能被放大。VAMPO是一种后训练框架,通过策略优化直接改善视频动作模型中的视觉动态。其关键思想是...
📄 English Summary
VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models
Video action models serve as an appealing foundation for Vision-Language-Action systems due to their ability to learn visual dynamics from large-scale video data and transfer this knowledge to downstream robot control. However, current diffusion-based video predictors are trained with likelihood-surrogate objectives that promote globally plausible predictions without explicitly optimizing the precision-critical visual dynamics necessary for manipulation. This objective mismatch often results in subtle errors in object pose, spatial relations, and contact timing, which can be amplified by downstream policies. VAMPO proposes a post-training framework that directly enhances visual dynamics in video action models through policy optimization. The core idea is to...
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等