Speech emotion recognition using wavelet packet reconstruction with attention-based deep recurrent neutral networks

Szczegóły
Opis

Tytuł:: Speech emotion recognition using wavelet packet reconstruction with attention-based deep recurrent neutral networks
Autorzy:: Meng, Hao
Yan, Tianhao
Wei, Hongwei
Ji, Xun
Powiązania:: https://bibliotekanauki.pl/articles/2090711.pdf
Data publikacji:: 2021
Wydawca:: Polska Akademia Nauk. Czytelnia Czasopism PAN
Tematy:: speech emotion recognition
voice activity detection
wavelet packet reconstruction
feature extraction
LSTM networks
attention mechanism
rozpoznawanie emocji mowy
wykrywanie aktywności głosowej
rekonstrukcja pakietu falkowego
wyodrębnianie cech
mechanizm uwagi
sieć LSTM
Źródło:: Bulletin of the Polish Academy of Sciences. Technical Sciences; 2021, 69, 1; e136300, 1--12
0239-7528
Język:: angielski
Prawa:: CC BY-NC-ND: Creative Commons Uznanie autorstwa - Użycie niekomercyjne - Bez utworów zależnych 4.0
Dostawca treści:: Biblioteka Nauki
: Artykuł

Przejdź do źródła

Speech emotion recognition (SER) is a complicated and challenging task in the human-computer interaction because it is difficult to find the best feature set to discriminate the emotional state entirely. We always used the FFT to handle the raw signal in the process of extracting the low-level description features, such as short-time energy, fundamental frequency, formant, MFCC (mel frequency cepstral coefficient) and so on. However, these features are built on the domain of frequency and ignore the information from temporal domain. In this paper, we propose a novel framework that utilizes multi-layers wavelet sequence set from wavelet packet reconstruction (WPR) and conventional feature set to constitute mixed feature set for achieving the emotional recognition with recurrent neural networks (RNN) based on the attention mechanism. In addition, the silent frames have a disadvantageous effect on SER, so we adopt voice activity detection of autocorrelation function to eliminate the emotional irrelevant frames. We show that the application of proposed algorithm significantly outperforms traditional features set in the prediction of spontaneous emotional states on the IEMOCAP corpus and EMODB database respectively, and we achieve better classification for both speaker-independent and speaker-dependent experiment. It is noteworthy that we acquire 62.52% and 77.57% accuracy results with speaker-independent (SI) performance, 66.90% and 82.26% accuracy results with speaker-dependent (SD) experiment in final.

Informacja

Powiązane pozycje