CLUTCH:用于解锁文本条件的手部动作建模的上下文化语言模型

📄 中文摘要

手在日常生活中扮演着重要角色,但自然手部动作的建模仍然未得到充分探索。现有的文本到手部动作生成或手部动画字幕的方法依赖于有限的、在实验室环境中捕获的数据集,这使得其在“野外”场景中的扩展成本高昂。此外,当前模型及其训练方案在文本与动作对齐的动画保真度方面存在困难。为了解决这些问题,提出了‘3D Hands in the Wild’(3D-HIW)数据集,包含32K个3D手部动作序列及其对齐文本,并提出了CLUTCH,一个基于大型语言模型的手部动画系统,具有两个关键创新:(a)SHIFT,一种新颖的VQ-VAE架构用于手部动作的标记化;(b)几何细化阶段。

📄 English Summary

CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

Hands play a crucial role in daily life, yet the modeling of natural hand motions remains underexplored. Existing methods for text-to-hand-motion generation or hand animation captioning rely on limited, studio-captured datasets, making them costly to scale to 'in-the-wild' settings. Furthermore, contemporary models and their training schemes struggle with capturing animation fidelity in text-motion alignment. To address these challenges, the '3D Hands in the Wild' (3D-HIW) dataset is introduced, containing 32K 3D hand-motion sequences with aligned text. Additionally, CLUTCH, an LLM-based hand animation system, is proposed, featuring two critical innovations: (a) SHIFT, a novel VQ-VAE architecture for tokenizing hand motion, and (b) a geometric refinement stage to enhance animation quality.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等