📄 中文摘要
动态神经辐射场(NeRF)在生成高保真说话人像的3D模型方面取得了显著成功。尽管渲染速度和生成质量取得了重大进展,但在准确高效地捕捉说话人像的嘴部运动方面,仍然存在挑战。针对这一挑战,本研究提出了一种基于眨眼嵌入和哈希网格地标编码的自动化方法。该方法通过引入眨眼嵌入,精确控制人脸的眼部动作,使其与音频输入更加自然地同步,从而增强了人像的真实感和表现力。同时,采用哈希网格对人脸地标进行高效编码,显著优化了嘴部运动的捕捉和重建过程。哈希网格的引入不仅提升了地标特征的表示效率,还加速了模型的训练和推理速度,使得复杂的人脸动态生成任务得以在更短的时间内完成。具体而言,系统首先从输入的音频中提取语音特征,并结合眨眼嵌入,预测人脸的关键地标点位。随后,这些地标点位通过哈希网格进行编码,生成紧凑且信息丰富的特征表示。这些编码后的特征被输入到基于NeRF的生成模型中,驱动生成具有高保真度和自然嘴部运动的3D人像。该方法有效解决了传统NeRF模型在处理精细嘴部动作时的局限性,实现了更逼真、更具表现力的音频驱动人脸生成。实验结果表明,该方法在视觉质量和嘴部同步性方面均优于现有技术,为人脸生成领域带来了新的突破。
📄 English Summary
Audio-Driven Talking Face Generation with Blink Embedding and Hash Grid Landmarks Encoding
Dynamic Neural Radiance Fields (NeRF) have demonstrated considerable success in generating high-fidelity 3D models of talking portraits. Despite significant advancements in rendering speed and generation quality, challenges persist in accurately and efficiently capturing mouth movements in talking portraits. To tackle this challenge, an automatic method based on blink embedding and hash grid landmarks encoding is proposed. This approach precisely controls facial eye movements using blink embedding, ensuring more natural synchronization with audio input, thereby enhancing the realism and expressiveness of the portrait. Concurrently, hash grids are employed for efficient encoding of facial landmarks, significantly optimizing the capture and reconstruction of mouth movements. The integration of hash grids not only improves the representation efficiency of landmark features but also accelerates model training and inference, enabling complex facial dynamic generation tasks to be completed in less time. Specifically, the system first extracts speech features from the input audio and, combined with blink embedding, predicts key facial landmark positions. Subsequently, these landmark positions are encoded via hash grids, generating compact and informative feature representations. These encoded features are then fed into a NeRF-based generative model to drive the creation of 3D portraits with high fidelity and natural mouth movements. This method effectively addresses the limitations of traditional NeRF models in handling subtle mouth actions, achieving more realistic and expressive audio-driven face generation. Experimental results demonstrate that this method outperforms existing techniques in terms of visual quality and mouth synchronization, offering a new breakthrough in the field of face generation.