Determination of representative emotional style of speech based on k-means algorithm

Sangshin Oh; Se-Yun Um; Inseon Jang; Chung Hyun Ahn; Hong-Goo Kang

doi:10.7776/ASK.2019.38.5.614

All Issue

2019 Vol.38, Issue 5 Preview Page

Research Article

Determination of representative emotional style of speech based on k-means algorithm k-평균 알고리즘을 활용한 음성의 대표 감정 스타일 결정 방법

30 September 2019. pp. 614-620

PDF XML

Abstract

In this paper, we propose a method to effectively determine the representative style embedding of each emotion class to improve the global style token-based end-to-end speech synthesis system. The emotion expressiveness of conventional approach was limited because it utilized only one style representative per each emotion. We overcome the problem by extracting multiple number of representatives per each emotion using a k-means clustering algorithm. Through the results of listening tests, it is proved that the proposed method clearly express each emotion while distinguishing one emotion from others.

Keywords

Speech synthesis

End-to-end speech synthesis

Emotional speech synthesis

Style token

본 논문은 전역 스타일 토큰(Global Style Token, GST)을 사용하는 종단 간(end-to-end) 감정 음성 합성 시스템의 성능을 높이기 위해 각 감정의 스타일 벡터를 효과적으로 결정하는 방법을 제안한다. 기존 방법은 각 감정을 표현하기 위해 한 개의 대푯값만을 사용하므로 감정 표현의 풍부함 측면에서 크게 제한된다. 이를 해결하기 위해 본 논문에서는 k-평균 알고리즘을 사용하여 다수의 대표 스타일을 추출하는 방법을 제안한다. 청취 평가를 통해 제안 방법을 이용해 추출한 각 감정의 대표 스타일이 기존 방법에 비해 감정 표현 정도가 뛰어나며, 감정 간의 차이를 명확히 구별할 수 있음을 보였다.

키워드

음성 합성

종단 간 음성 합성

감정 음성 합성

스타일 토큰

References

H. Zen, A. Senior, and M. Schuster, "Statistical parametric speech synthesis using deep neural networks," Proc. IEEE ICASSP, 7962-7966 (2013).

10.1109/ICASSP.2013.6639215

Y. Qian, Y. Fan, W. Hu, and F. K Soong, "On the training aspects of deep neural network (dnn) for parametric tts synthesis," Proc. IEEE ICASSP, 3829-3833 (2014).

10.1109/ICASSP.2014.6854318

A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio," arXiv preprint arXiv: 1609.03499 (2016).

Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J Weiss, N. jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A Saurous, "Tacotron: Towards end-to-end speech synthesis," Proc. Interspeech, 4006-4010 (2017).

10.21437/Interspeech.2017-145228580117PMC5434753

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry- Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu, "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions," Proc. IEEE ICASSP, 4779-4783 (2018).

10.1109/ICASSP.2018.8461368

J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, "Char2wav: End-to-end speech synthesis," Proc. ICLR, 1-6 (2017).

A. Gibiansky, S. Arik, G. Diamos, J. Miler, K. Peng, W. Ping, J. Raiman, and Y. Zhou, "Deep voice 2: Multi-speaker neural text-to-speech," Advances in NIPS, 2962-2970 (2017).

Y. Wang, R. J. Skerry-Ryan, Y. Xiao, D. Stanton, J. Shor, E. Battenberg, R. Clark, and R. A. Saurous, "Uncovering latent style factors for expressive speech synthesis," arXiv preprint arXiv:1711.00520 (2017).

Y. Lee, A. Rabiee, and S. -Y. Lee, "Emotional end-to- end neural speech synthesizer," arXiv preprint arXiv: 1711.05447 (2017).

O. Kwon, I. Jang, C. H. Ahn, and H. -G. Kang, "Emotional speech synthesis based on style embedded Tacotron2 framework," Proc. ITC-CSCC, 1-4 (2019).

10.1109/ITC-CSCC.2019.8793393PMC6446504

J. Tao, Y. Kang, and A. Li, "Prosody conversion from neutral speech to emotional speech," IEEE Trans. on Audio, Speech, and Lang. Process. 14, 1145-1154 (2006).

10.1109/TASL.2006.876113

Y. Chen, M. Chu, E. Chang, J. Liu, and R. Liu, "Voice conversion with smoothed gmm and map adaptation," Eighth European Conference on Speech Communication and Technology, 2413-2416 (2003).

Y. -J. Zhang, S. Pan, L. He, and Z. -H. Ling, "Learning latent representation for style control and transfer in end-to-end speech synthesis," Proc. IEEE ICASSP, 6945-6949 (2019).

10.1109/ICASSP.2019.868362331117540

Y. Wang, D. Stanton, Y. Zhang, RJ. Skerry- Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, "Style tokens: Unsupervised style modeling, control and transfer in end-to- end speech synthesis," arXiv preprint arXiv:1803.09017 (2018).

RJ. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, "Towards end-to-end prosody transfer for expressive speech synthesis with tacotron," arXiv preprit arXiv:1803.09047 (2018).

S. Lloyd, "Least squares quantization in PCM," IEEE Trans. on information theory, 28, 129-137 (1982).

10.1109/TIT.1982.1056489

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 38
No :5
Pages :614-620
Received Date : 2019-07-16
Accepted Date : 2019-09-04
DOI :https://doi.org/10.7776/ASK.2019.38.5.614

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue