Proact-VL:用于实时 AI 伴侣的主动视频语言模型
📄 中文摘要
主动和实时的交互体验对于类人 AI 伴侣至关重要,但面临三大挑战:一是如何在持续流输入下实现低延迟推理,二是自主决定何时响应,三是控制生成内容的质量和数量以满足实时约束。该研究通过两个游戏场景——评论员和引导者,展示了 AI 伴侣的实例,这些场景适合进行自动评估。提出了实时游戏基准,这是一个大规模数据集,涵盖了三种代表性场景:单人评论、共同评论和用户引导,并展示了 Proact-VL,一个将多模态语言模型塑造成主动、实时交互的通用框架。
📄 English Summary
Proact-VL: A Proactive VideoLLM for Real-Time AI Companions
Proactive and real-time interactive experiences are crucial for human-like AI companions, yet they encounter three main challenges: achieving low-latency inference under continuous streaming inputs, autonomously deciding when to respond, and controlling both the quality and quantity of generated content to meet real-time constraints. This research instantiates AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. A Live Gaming Benchmark is introduced, comprising a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance. Proact-VL is presented as a general framework that shapes multimodal language models into proactive, real-time interactions.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等