Combining multi-task autoencoder with Wasserstein generative adversarial networks for improving speech recognition performance

Chao Yuan Kao; Hanseok Ko

doi:10.7776/ASK.2019.38.6.670

All Issue

2019 Vol.38, Issue 6 Preview Page Next Page

Research Article

Combining multi-task autoencoder with Wasserstein generative adversarial networks for improving speech recognition performance 음성인식 성능 개선을 위한 다중작업 오토인코더와 와설스타인식 생성적 적대 신경망의 결합

30 November 2019. pp. 670-677

PDF XML

Abstract

As the presence of background noise in acoustic signal degrades the performance of speech or acoustic event recognition, it is still challenging to extract noise-robust acoustic features from noisy signal. In this paper, we propose a combined structure of Wasserstein Generative Adversarial Network (WGAN) and Multi- Task AutoEncoder (MTAE) as deep learning architecture that integrates the strength of MTAE and WGAN respectively such that it estimates not only noise but also speech features from noisy acoustic source. The proposed MTAE-WGAN structure is used to estimate speech signal and the residual noise by employing a gradient penalty and a weight initialization method for Leaky Rectified Linear Unit (LReLU) and Parametric ReLU (PReLU). The proposed MTAE-WGAN structure with the adopted gradient penalty loss function enhances the speech features and subsequently achieve substantial Phoneme Error Rate (PER) improvements over the stand-alone Deep Denoising Autoencoder (DDAE), MTAE, Redundant Convolutional Encoder-Decoder (R-CED) and Recurrent MTAE (RMTAE) models for robust speech recognition.

Keywords

Speech enhancement

Wasserstein Generative Adversarial Network (WGAN)

Weight initialization

Robust speech recognition

Deep Neural Network (DNN)

음성 또는 음향 이벤트 신호에서 발생하는 배경 잡음은 인식기의 성능을 저하시키는 원인이 되며, 잡음에 강인한 특징을 찾는데 많은 노력을 필요로 한다. 본 논문에서는 딥러닝을 기반으로 다중작업 오토인코더(Multi-Task AutoEncoder, MTAE) 와 와설스타인식 생성적 적대 신경망(Wasserstein GAN, WGAN)의 장점을 결합하여, 잡음이 섞인 음향신호에서 잡음과 음성신호를 추정하는 네트워크를 제안한다. 본 논문에서 제안하는 MTAE-WGAN는 구조는 구배 페널티(Gradient Penalty) 및 누설 Leaky Rectified Linear Unit (LReLU) 모수 Parametric ReLU (PReLU)를 활용한 변수 초기화 작업을 통해 음성과 잡음 성분을 추정한다. 직교 구배 페널티와 파라미터 초기화 방법이 적용된 MTAE-WGAN 구조를 통해 잡음에 강인한 음성특징 생성 및 기존 방법 대비 음소 오인식률(Phoneme Error Rate, PER)이 크게 감소하는 성능을 보여준다.

키워드

음성인식

와설스타이식 생성적 적대 신경망

직교 구배 페널티

초기화

딥러닝

References

P. Scalart and J. V. Filho "Speech enhancement based on a priori signal to noise estimation," Proc. IEEE ICASSP. 629-632 (1996).

Y. Ephraim and D. Malah, "Speech enhancement using a minimum meansquare error short-time spectral amplitude estimator," IEEE Trans. Acoust. Speech Signal Process. 32, 1109-1121 (1984).

10.1109/TASSP.1984.1164453

N. Mohammadiha, P. Smaragdis, and A. Leijion, "Supervised and unsupervised speech enhancement using nonnegative matrix factorization," IEEE Trans. Audio, Speech Lang. Process. 21, 2140- 2151 (2013).

10.1109/TASL.2013.2270369

Y. Xu, J. Du, L. -R. Dai, and C. -H. Lee, "A regression approach to speech enhancement based on deep neural networks," IEEE Trans. Audio, Speech Lang. Process. 23, 7-19 (2015).

10.1109/TASLP.2014.2364452

S. R. Park and J. W. Lee, "A fully convolutional neural network for speech enhancement," Proc. Interspeech, 1993-1997 (2017).

10.21437/Interspeech.2017-1465

A. L. Maas, Q. V. Le, T. M. O'Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, "Recurrent neural networks for noise reduction in robust ASR," Proc. Interspeech, 22-25 (2012).

X. Feng, Y. Zhang, and J. Glass, "Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition," Proc. IEEE ICASSP. 1759-1763 (2014).

10.1109/ICASSP.2014.6853900

B. Li and K. C. Sim, "A spectral masking approach to noise-robust speech recognition using deep neural networks," IEEE Trans. Audio, Speech Lang. Process. 22, 1296-1305 (2014).

10.1109/TASLP.2014.2329237

D. Wang and J. Chen, "Supervised speech separation based on deep learning: An overview," IEEE/ACM Trans. Audio, Speech Lang. Process. 26, 1702-1726 (2018).

10.1109/TASLP.2018.284215931223631PMC6586438

D. Berthelot, T. Schumm, and L. Metz, "Began: Boundary equilibrium generative adversarial networks." arXiv preprint arXiv:1703.10717 (2017).

S. Tulyakov, M. -Y. Liu, X. Yang, and J. Kautz, "Mocogan: Decomposing motion and content for video generation," Proc. the IEEE conference on computer vision and pattern recognition, 1526-1535 (2018).

10.1109/CVPR.2018.00165

L. Yu, W. Zhang, J. Wang, and Y. Yu, "Seqgan: Sequence generative adversarial nets with policy gradient." Thirty-First AAAI Conference on Artificial Intelligence, 2852-2858 (2017).

S. Pascual, A. Bonafonte, and J. Serra, "SEGAN: Speech enhancement generative adversarial network," Proc. Interspeech, 3642-3646 (2017).

10.21437/Interspeech.2017-1428

A. Pandey and D. Wang, "On adversarial training and loss functions for speech enhancement," Proc. IEEE ICASSP. 5414-5418 (2018).

10.1109/ICASSP.2018.8462614

C. Donahue, B. Li, and R. Prabhavalkar, "Exploring speech enhancement with generative adversarial networks for robust speech recognition," Proc. IEEE ICASSP. 5024-5028 (2018).

10.1109/ICASSP.2018.8462581

W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition." Proc. IEEE ICASSP. 4960-4964 (2016).

10.1109/ICASSP.2016.7472621

A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).

D. Michelsanti and Z. H. Tan, "Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification," Proc. Interspeech, 2008-2012 (2017).

10.21437/Interspeech.2017-1620

M. Mimura, S. Sakai, and T. Kawahara, "Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks," Proc. IEEE Automatic Speech Recognition and Understanding Workshop, 134-140 (2017).

10.1109/ASRU.2017.8268927

H. Zhang, C. Liu, N. Inoue, and K. Shinoda, "Multi- task autoencoder for noise-robust speech recognition," Proc. IEEE ICASSP. 5599-5603 (2018).

10.1109/ICASSP.2018.8461446PMC5999154

K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," Proc. the IEEE International Conference on Computer Vision, 1026-1034 (2015).

10.1109/ICCV.2015.123

M. D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, and G. E. Hinton, "On rectified linear units for speech processing," Proc. IEEE ICASSP. 3517- 3521 (2017).

X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," Proc. the thirteenth international conference on artificial intelligence and statistics, 249-256 (2010).

S. X. Wen, J. Du, and C. -H. Lee, "On generating mixing noise signals with basis functions for simulating noisy speech and learning dnnbased speech enhancement models," Proc. IEEE International Workshop on MLSP. 1-6 (2017).

10.1109/MLSP.2017.8168192

ITU-T, Rec. P. 56: Objective Measurement of Active Speech Level, 2011.

X. Lu, Y. T. Sao, S. Matsuda, and C. Hori, "Speech enhancement based on deep denoising autoencoder," Proc. Interspeech, 436-440 (2013).

R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent neural networks," Proc. 30th ICML. 2347-2355 (2013).

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 38
No :6
Pages :670-677
Received Date : 2019-10-02
Accepted Date : 2019-11-11
DOI :https://doi.org/10.7776/ASK.2019.38.6.670

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue