从指令到辅助:一个将说明手册与组装视频对齐的数据集,用于评估多模态大型语言模型

📄 中文摘要

随着大型语言模型(LLMs)的快速发展,人工智能(AI)在支持复杂现实任务方面的能力得到了显著提升,研究逐渐超越文本边界,进入多模态环境,催生了多模态大型语言模型(MLMs)。当前,基于LLM的助手在解决技术或特定领域问题中的应用日益广泛,未来的趋势是扩展这些助手的输入领域,以充分利用MLMs。这些MLMs理想情况下应作为程序性任务中的实时助手,能够集成用户所处环境的视图,甚至通过虚拟现实(VR)或增强现实(AR)共享相同的视角。

📄 English Summary

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

Recent advancements in Large Language Models (LLMs) have significantly enhanced the capability of Artificial Intelligence (AI) to support complex real-world tasks, pushing research beyond textual boundaries into multimodal contexts and leading to the emergence of Multimodal Large Language Models (MLMs). The increasing adoption of LLM-based assistants for solving technical or domain-specific problems indicates a natural progression towards expanding the input domains of these assistants by leveraging MLMs. Ideally, these MLMs should function as real-time assistants in procedural tasks, integrating a view of the user's environment or even sharing the same perspective through Virtual Reality (VR) or Augmented Reality (AR).

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等