📄 中文摘要
提出了一种名为“空间编码思维”的框架,将RGB视频转换为明确且时间一致的3D表示,以实现物理世界的视觉问答。研究发现,所提出的空间编码器能够将视频解析为结构化的空间编码,包含明确的3D定向边界框和语义标签,从而使大型语言模型(LLMs)能够直接对显式空间变量进行推理。具体而言,提出的空间编码器通过统一6D物体解析与跟踪骨干网络以及几何预测来编码图像和几何特征,并进一步通过强化学习对LLMs进行微调,使用空间标准奖励来鼓励对视角敏感的几何推理。
📄 English Summary
Thinking with Spatial Code for Physical-World Video Reasoning
The study introduces a framework called Thinking with Spatial Code, which transforms RGB videos into explicit and temporally coherent 3D representations for physical-world visual question answering. An empirical finding reveals that the proposed spatial encoder can parse videos into structured spatial codes featuring explicit 3D oriented bounding boxes and semantic labels, allowing large language models (LLMs) to reason directly over explicit spatial variables. Specifically, the spatial encoder encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction. Additionally, LLMs are fine-tuned using reinforcement learning with a spatial rubric reward that promotes perspective-aware and geometrically grounded reasoning.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等