📄 中文摘要
本研究系统性地探讨了在小型数据集上,多模态情感识别中复杂注意力机制的有效性。在EAV数据集上,构建了三类模型进行评估:基线Transformer模型(M1)、新颖的分解注意力机制模型(M2),以及改进的CNN基线模型(M3)。实验结果表明,在小型数据集上,复杂的注意力机制模型表现持续不佳。具体而言,M2模型在性能上比M1模型低了5%到13%。相比之下,利用传统领域特征(如音频的梅尔频率倒谱系数MFCC、视觉的面部关键点等)结合改进的CNN模型(M3),展现出显著优势。M3模型在多个评估指标上均超越了Transformer基线和分解注意力模型。
📄 English Summary
Attention Isn't All You Need for Emotion Recognition:Domain Features Outperform Transformers on the EAV Dataset
This study systematically investigates the efficacy of complex attention mechanisms in multimodal emotion recognition on small datasets. Three categories of models were implemented and evaluated on the EAV dataset: baseline Transformer models (M1), novel factorized attention mechanism models (M2), and improved CNN baseline models (M3). Experimental results consistently demonstrate that sophisticated attention mechanisms underperform on small datasets. Specifically, M2 models exhibited a performance deficit of 5% to 13% compared to M1 models. In contrast, leveraging traditional domain-specific features (e.g., Mel-frequency cepstral coefficients (MFCCs) for audio, facial landmarks for visual data) combined with improved CNN models (M3) showed significant advantages. M3 models surpassed both Transformer baselines and factorized attention models across various evaluation metrics. This suggests that for emotion recognition tasks with limited data, the combination of carefully engineered domain features and convolutional neural networks can more effectively capture emotional patterns than complex attention mechanisms, which typically require extensive data for training. The research highlights the importance of deep understanding of inherent data characteristics and feature engineering in resource-constrained or data-scarce scenarios, potentially outweighing the pursuit of model complexity. Furthermore, the findings imply that Transformer models, without large-scale pre-training data, may exhibit limitations in generalization ability and feature extraction efficiency, particularly when handling multimodal fusion.