Noise-resistant multimodal transformer for emotion recognition

Journal article


Liu, Yuanyuan, Zhang, Haoyu, Zhan, Yibing, Chen, Zijing, Yin, Guanghao, Wei, Lin and Chen, Zhe. (2025). Noise-resistant multimodal transformer for emotion recognition. International Journal of Computer Vision. pp. 3020-3040. https://doi.org/10.1007/s11263-024-02304-3
AuthorsLiu, Yuanyuan, Zhang, Haoyu, Zhan, Yibing, Chen, Zijing, Yin, Guanghao, Wei, Lin and Chen, Zhe
Abstract

Multimodal emotion recognition identifies human emotions from various data modalities like video, text, and audio. However, we found that this task can be easily affected by noisy information that does not contain useful semantics and may occur at different locations of a multimodal input sequence. To this end, we present a novel paradigm that attempts to extract noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness of multimodal emotion understanding against noisy information. Our new pipeline, namely Noise-Resistant Multimodal Transformer (NORM-TR), mainly introduces a Noise-Resistant Generic Feature (NRGF) extractor and a multimodal fusion Transformer for the multimodal emotion recognition task. In particular, we make the NRGF extractor learn to provide a generic and disturbance-insensitive representation so that consistent and meaningful semantics can be obtained. Furthermore, we apply a multimodal fusion Transformer to incorporate Multimodal Features (MFs) of multimodal inputs (serving as the key and value) based on their relations to the NRGF (serving as the query). Therefore, the possible insensitive but useful information of NRGF could be complemented by MFs that contain more details, achieving more accurate emotion understanding while maintaining robustness against noises. To train the NORM-TR properly, our proposed noise-aware learning scheme complements normal emotion recognition losses by enhancing the learning against noises. Our learning scheme explicitly adds noises to either all the modalities or a specific modality at random locations of a multimodal input sequence. We correspondingly introduce two adversarial losses to encourage the NRGF extractor to learn to extract the NRGFs invariant to the added noises, thus facilitating the NORM-TR to achieve more favorable multimodal emotion recognition performance. In practice, extensive experiments can demonstrate the effectiveness of the NORM-TR and the noise-aware learning scheme for dealing with both explicitly added noisy information and the normal multimodal sequence with implicit noises. On several popular multimodal datasets (e.g., MOSI, MOSEI, IEMOCAP, and RML), our NORM-TR achieves state-of-the-art performance and outperforms existing methods by a large margin, which demonstrates that the ability to resist noisy information in multimodal input is important for effective emotion recognition.

Keywordsmultimodal; emotion recognition; transformer; noise-resistant generic feature; noise-aware learning scheme
Year2025
JournalInternational Journal of Computer Vision
Journal citationpp. 3020-3040
PublisherSpringer
ISSN0920-5691
Digital Object Identifier (DOI)https://doi.org/10.1007/s11263-024-02304-3
Scopus EID2-s2.0-105003158610
Page range3020-3040
FunderNational Natural Science Foundation of China (NSFC)
Natural Science Foundation of Hubei Province
Major Science and Technology Projects in Yunnan Province
Publisher's version
License
All rights reserved
File Access Level
Controlled
Output statusIn press
Publication process dates
Deposited16 Jun 2025
Grant ID62076227
62002090
2023AFB57
202202AD080007
Additional information

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024

Permalink -

https://acuresearchbank.acu.edu.au/item/91z31/noise-resistant-multimodal-transformer-for-emotion-recognition

Restricted files

Publisher's version

  • 0
    total views
  • 1
    total downloads
  • 0
    views this month
  • 1
    downloads this month
These values are for the period from 19th October 2020, when this repository was created.

Export as

Related outputs

Dynamically modulated mask sparse tracking
Chen, Zijing, You, Xinge, Zhong, Boxuan, Li, Jun and Tao, Dacheng. (2017). Dynamically modulated mask sparse tracking. IEEE Transactions on Cybernetics. 47(11), pp. 3706-3718. https://doi.org/10.1109/TCYB.2016.2577718