ForeAct:通过高效的视觉前瞻规划引导视觉-语言-行动模型
📄 中文摘要
提出了一种名为视觉前瞻规划(ForeAct)的通用高效规划器,旨在逐步引导视觉-语言-行动(VLA)模型执行具体的可操作动作,尤其是在开放世界环境中。ForeAct利用想象的未来观察和子任务描述,使VLA能够专注于视觉-运动推理,而非高层次的语义推理,从而提高准确性和泛化能力。该规划器包含一个高效的前瞻图像生成模块,能够在仅0.33秒内从当前视觉输入和语言指令中预测出640×480的高质量未来观察,极大地提升了执行效率。
📄 English Summary
ForeAct: Steering Your VLA with Efficient Visual Foresight Planning
This research presents Visual Foresight Planning (ForeAct), a general and efficient planner designed to guide Vision-Language-Action (VLA) models in executing concrete actions step-by-step, particularly in open-world environments. By leveraging imagined future observations and subtask descriptions, ForeAct enables the VLA to focus on visuo-motor inference rather than high-level semantic reasoning, resulting in improved accuracy and generalization. The planner includes a highly efficient foresight image generation module that predicts a high-quality 640×480 future observation from the current visual input and language instruction in just 0.33 seconds on an H100 GPU, significantly enhancing execution efficiency.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等