LiteVLA-Edge: 嵌入式机器人量化设备端多模态控制

📄 中文摘要

Vision-Language-Action (VLA) 模型为感知、语言调节和动作生成提供了统一框架,但许多现有系统由于计算需求和推理延迟,难以在嵌入式机器人环境中部署。LiteVLA-Edge 是一种面向部署的 VLA 管道,支持在 Jetson Orin 类硬件上进行完全的设备端推理。该方法结合了在 FP32 下的监督图像到动作微调、后训练的 4 位 GGUF 量化以及通过 llama.cpp 运行时的 GPU 加速推理。在我们的部署配置下,LiteVLA-Edge 实现了平均端到端延迟为 150.5 毫秒(约 6.6 Hz),并且完全离线操作。

📄 English Summary

LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics

Vision-Language-Action (VLA) models offer a unified framework for perception, language conditioning, and action generation. However, many existing systems face challenges in deployment within embedded robotic environments due to their computational demands and inference latency. LiteVLA-Edge presents a deployment-oriented VLA pipeline designed for fully on-device inference on Jetson Orin-class hardware. The approach integrates supervised image-to-action fine-tuning in FP32 with post-training 4-bit GGUF quantization and GPU-accelerated inference via the llama.cpp runtime. Under the specified deployment configuration, LiteVLA-Edge achieves a mean end-to-end latency of 150.5 ms (approximately 6.6 Hz) while operating entirely offline.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等