ETA-VLA：通过时间融合和LLM内部稀疏化实现高效的令牌适配用于视觉-语言-动作模型

出处: ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

发布: 2026年3月30日

📄 中文摘要

提出了一种高效的令牌适配框架ETA-VLA，旨在优化视觉-语言-动作（VLA）模型在自动驾驶系统中的应用。该框架处理过去的多视角图像帧，并引入了一种新颖的内部LLM稀疏聚合器（ILSA），该聚合器灵感来源于人类驾驶员的注意力分配。ILSA能够动态识别和剪枝冗余信息，从而显著降低计算负担，尤其是在自注意力机制的二次复杂性方面。ETA-VLA的设计旨在提升时间推理的准确性，同时保持计算效率，为复杂场景的解析和控制命令的执行提供了更为高效的解决方案。

🏷️ 相关标签

#视觉-语言-动作 #高效令牌适配 #内部稀疏聚合器 #自动驾驶 #时间推理

📄 English Summary

ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

The study presents ETA-VLA, an Efficient Token Adaptation framework designed to optimize Vision-Language-Action (VLA) models for autonomous driving applications. This framework processes past multi-view image frames and introduces a novel Intra-LLM Sparse Aggregator (ILSA), inspired by human driver attention allocation. ILSA dynamically identifies and prunes redundant information, significantly reducing computational burdens, particularly addressing the quadratic complexity of self-attention mechanisms. The design of ETA-VLA aims to enhance the accuracy of temporal reasoning while maintaining computational efficiency, providing a more effective solution for interpreting complex scenes and executing control commands.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误