基于自监督语音表示的语音增强模型位置不变微调

📄 中文摘要

在噪声环境下,将前端语音增强(SE)模型与基于自监督学习(SSL)的语音模型相结合,对下游任务具有显著效果。通常,语音增强模型通过增强语音和纯净语音之间的均方误差(MSE)损失,利用SSL表示进行微调。然而,MSE容易过度利用SSL模型中的位置嵌入,导致优化目标通过位置相关性而非真正的语音特征相关性最小化。这种过度依赖位置信息可能限制模型在面对不同时间对齐或语速变化的语音时的泛化能力。为了解决这一问题,一种新的位置不变微微调方法被提出。该方法旨在强制模型学习对语音内容更鲁棒的表示,减少对特定位置信息的依赖。

📄 English Summary

Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations

Integrating front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models proves effective for downstream tasks in noisy conditions. Conventionally, SE models are fine-tuned using SSL representations with a mean squared error (MSE) loss between enhanced and clean speech. However, MSE is prone to over-exploiting positional embeddings within SSL models, allowing the objective to be minimized primarily through positional correlations rather than genuine speech feature relationships. This over-reliance on positional information can restrict the model's generalization capability when encountering speech with varying temporal alignments or speaking rates. To address this limitation, a novel position-invariant fine-tuning method is proposed. This approach aims to compel the model to learn more robust representations of speech content, thereby reducing its dependence on specific positional cues. A new loss function is introduced, which considers temporal offset robustness when computing the similarity between enhanced and clean speech. Specifically, techniques such as dynamic time warping (DTW) or cross-correlation might be employed to permit a certain degree of temporal misalignment between the enhanced and clean speech, preventing the model from merely reducing loss by aligning positional information. This encourages the model to focus on intrinsic speech features and semantic content rather than absolute positions within the sequence. Experimental results demonstrate that, compared to traditional MSE-based fine-tuning methods, position-invariant fine-tuning significantly improves the performance of speech enhancement models across various noisy conditions. It exhibits stronger robustness and generalization, particularly when processing speech with temporal shifts or variability.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等