基于多模态一致性指导的无参考数据选择用于自动语音识别口音适应
📄 中文摘要
自动语音识别(ASR)系统在处理带有口音的语音时常常表现不佳,因为声学-语音和韵律的变化导致与训练数据的不匹配,使得标注的口音适应成本高昂。常见的伪标签选择启发式方法主要以文本为中心(如困惑度(PPL)过滤),可能偏向流畅但声学上不匹配的假设,从而在微调时导致错误放大。为了解决这一问题,提出了一种基于多模态一致性指导的无参考数据选择管道,旨在在无标签的转导协议下进行ASR口音适应。该管道首先通过基于子模块互信息的目标感知预选择步骤,提升查询相关性并减少后续计算。该方法有效提高了口音适应的效率和准确性。
📄 English Summary
Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation
Automatic speech recognition (ASR) systems often perform poorly on accented speech due to acoustic-phonetic and prosodic shifts that create mismatches with training data, making labeled accent adaptation costly. Common pseudo-label selection heuristics are primarily text-centric, such as perplexity (PPL) filtering, which may favor fluent yet acoustically mismatched hypotheses, leading to error amplification during fine-tuning. To address this issue, a multimodal consistency-guided, reference-free data selection pipeline is proposed for ASR accent adaptation under a transductive, label-free protocol. The pipeline begins with a target-aware preselection step based on submodular mutual information to enhance query relevance and reduce downstream computation. This approach significantly improves the efficiency and accuracy of accent adaptation.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等