Loading...
Thumbnail Image
Item

Noise-resistant multimodal transformer for emotion recognition

Liu, Yuanyuan
Zhang, Haoyu
Zhan, Yibing
Chen, Zijing
Yin, Guanghao
Wei, Lin
Chen, Zhe
Citations
Google Scholar:
Altmetric:
Abstract
Multimodal emotion recognition identifies human emotions from various data modalities like video, text, and audio. However, we found that this task can be easily affected by noisy information that does not contain useful semantics and may occur at different locations of a multimodal input sequence. To this end, we present a novel paradigm that attempts to extract noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness of multimodal emotion understanding against noisy information. Our new pipeline, namely Noise-Resistant Multimodal Transformer (NORM-TR), mainly introduces a Noise-Resistant Generic Feature (NRGF) extractor and a multimodal fusion Transformer for the multimodal emotion recognition task. In particular, we make the NRGF extractor learn to provide a generic and disturbance-insensitive representation so that consistent and meaningful semantics can be obtained. Furthermore, we apply a multimodal fusion Transformer to incorporate Multimodal Features (MFs) of multimodal inputs (serving as the key and value) based on their relations to the NRGF (serving as the query). Therefore, the possible insensitive but useful information of NRGF could be complemented by MFs that contain more details, achieving more accurate emotion understanding while maintaining robustness against noises. To train the NORM-TR properly, our proposed noise-aware learning scheme complements normal emotion recognition losses by enhancing the learning against noises. Our learning scheme explicitly adds noises to either all the modalities or a specific modality at random locations of a multimodal input sequence. We correspondingly introduce two adversarial losses to encourage the NRGF extractor to learn to extract the NRGFs invariant to the added noises, thus facilitating the NORM-TR to achieve more favorable multimodal emotion recognition performance. In practice, extensive experiments can demonstrate the effectiveness of the NORM-TR and the noise-aware learning scheme for dealing with both explicitly added noisy information and the normal multimodal sequence with implicit noises. On several popular multimodal datasets (e.g., MOSI, MOSEI, IEMOCAP, and RML), our NORM-TR achieves state-of-the-art performance and outperforms existing methods by a large margin, which demonstrates that the ability to resist noisy information in multimodal input is important for effective emotion recognition.
Keywords
multimodal, emotion recognition, transformer, noise-resistant generic feature, noise-aware learning scheme
Date
2025
Type
Journal article
Journal
International Journal of Computer Vision
Book
Volume
Issue
Page Range
3020-3040
Article Number
ACU Department
Peter Faber Business School
Faculty of Law and Business
Relation URI
Source URL
Event URL
Open Access Status
License
All rights reserved
File Access
Controlled
Notes
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024