ETA-VLA:通过时间融合和LLM内部稀疏化实现高效的令牌适配用于视觉-语言-动作模型

📄 中文摘要

提出了一种高效的令牌适配框架ETA-VLA,旨在优化视觉-语言-动作(VLA)模型在自动驾驶系统中的应用。该框架处理过去的多视角图像帧,并引入了一种新颖的内部LLM稀疏聚合器(ILSA),该聚合器灵感来源于人类驾驶员的注意力分配。ILSA能够动态识别和剪枝冗余信息,从而显著降低计算负担,尤其是在自注意力机制的二次复杂性方面。ETA-VLA的设计旨在提升时间推理的准确性,同时保持计算效率,为复杂场景的解析和控制命令的执行提供了更为高效的解决方案。

📄 English Summary

ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

The study presents ETA-VLA, an Efficient Token Adaptation framework designed to optimize Vision-Language-Action (VLA) models for autonomous driving applications. This framework processes past multi-view image frames and introduces a novel Intra-LLM Sparse Aggregator (ILSA), inspired by human driver attention allocation. ILSA dynamically identifies and prunes redundant information, significantly reducing computational burdens, particularly addressing the quadratic complexity of self-attention mechanisms. The design of ETA-VLA aims to enhance the accuracy of temporal reasoning while maintaining computational efficiency, providing a more effective solution for interpreting complex scenes and executing control commands.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等