基于空间编码的物理世界视频推理

出处: Thinking with Spatial Code for Physical-World Video Reasoning

发布: 2026年3月9日

📄 中文摘要

提出了一种名为“空间编码思维”的框架，将RGB视频转换为明确且时间一致的3D表示，以实现物理世界的视觉问答。研究发现，所提出的空间编码器能够将视频解析为结构化的空间编码，包含明确的3D定向边界框和语义标签，从而使大型语言模型（LLMs）能够直接对显式空间变量进行推理。具体而言，提出的空间编码器通过统一6D物体解析与跟踪骨干网络以及几何预测来编码图像和几何特征，并进一步通过强化学习对LLMs进行微调，使用空间标准奖励来鼓励对视角敏感的几何推理。

🏷️ 相关标签

#空间编码 #视频推理 #视觉问答 #3D表示 #大型语言模型

📄 English Summary

Thinking with Spatial Code for Physical-World Video Reasoning

The study introduces a framework called Thinking with Spatial Code, which transforms RGB videos into explicit and temporally coherent 3D representations for physical-world visual question answering. An empirical finding reveals that the proposed spatial encoder can parse videos into structured spatial codes featuring explicit 3D oriented bounding boxes and semantic labels, allowing large language models (LLMs) to reason directly over explicit spatial variables. Specifically, the spatial encoder encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction. Additionally, LLMs are fine-tuned using reinforcement learning with a spatial rubric reward that promotes perspective-aware and geometrically grounded reasoning.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

Thinking with Spatial Code for Physical-World Video Reasoning

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误