Comparison of environmental sound classification performance of convolutional neural networks according to audio preprocessing methods

Wongeun Oh

doi:10.7776/ASK.2020.39.3.143

All Issue

2020 Vol.39, Issue 3 Next Page

Research Article

Comparison of environmental sound classification performance of convolutional neural networks according to audio preprocessing methods 오디오 전처리 방법에 따른 콘벌루션 신경망의 환경음 분류 성능 비교

31 May 2020. pp. 143-149

PDF XML

Abstract

This paper presents the effect of the feature extraction methods used in the audio preprocessing on the classification performance of the Convolutional Neural Networks (CNN). We extract mel spectrogram, log mel spectrogram, Mel Frequency Cepstral Coefficient (MFCC), and delta MFCC from the UrbanSound8K dataset, which is widely used in environmental sound classification studies. Then we scale the data to 3 distributions. Using the data, we test four CNNs, VGG16, and MobileNetV2 networks for performance assessment according to the audio features and scaling. The highest recognition rate is achieved when using the unscaled log mel spectrum as the audio features. Although this result is not appropriate for all audio recognition problems but is useful for classifying the environmental sounds included in the Urbansound8K.

Keywords

Environmental sound classification

Convolutional neural networks

Audio feature extraction

Audio preprocessing

본 논문에서는 딥러닝(deep learning)을 이용하여 환경음 분류 시 전처리 단계에서 사용하는 특징 추출 방법이 콘볼루션 신경망의 분류 성능에 미치는 영향에 대해서 다루었다. 이를 위해 환경음 분류 연구에서 많이 사용되는 UrbanSound8K 데이터셋에서 멜 스펙트로그램(mel spectrogram), 로그 멜 스펙트로그램(log mel spectrogram), Mel Frequency Cepstral Coefficient(MFCC), 그리고 delta MFCC를 추출하고 각각을 3가지 분포로 스케일링하였다. 이 데이터를 이용하여 4 종의 콘볼루션 신경망과 이미지넷에서 좋은 성능을 보였던 VGG16과 MobileNetV2 신경망을 학습시킨 다음 오디오 특징과 스케일링 방법에 따른 인식률을 구하였다. 그 결과 인식률은 스케일링하지 않은 로그 멜 스펙트럼을 사용했을 때 가장 우수한 것으로 나타났다. 도출된 결과를 모든 오디오 인식 문제로 일반화하기는 힘들지만, Urbansound8K의 환경음이 포함된 오디오를 분류할 때는 유용하게 적용될 수 있을 것이다.

키워드

환경음 분류

콘볼루션 신경망

오디오 특징 추출

오디오 전처리

References

K. J. Piczak, "Environmental sound classification with convolutional neural networks," Proc. IEEE 25th International Workshop on Machine Learning for Signal Processing, 1-6 (2015).

10.1109/MLSP.2015.7324337

Y. Tokozume and T. Harada, "Learning environmental sounds with end-to-end convolutional neural network," Proc. 2017 IEEE ICASSP. 2721-2725 (2017).

10.1109/ICASSP.2017.7952651

V. Boddapati, A. Petef, J. Rasmusson, and L. Lundberg, "Classifying environmental sounds using image recognition networks," Procedia Comput. Sci. 112, 2048-2056 (2017).

10.1016/j.procs.2017.08.250

Y. Su, K. Zhang, J. Wang, and K. Madani, "Environment sound classification using a two-stream CNN based on decision-level fusion," Sensors, 19, 1733 (2019).

10.3390/s1907173330978974PMC6479959

J. Lee, W. Kim, and K. Lee, "Convolutional neural network based traffic sound classification robust to environmental noise" (in Korean), J. Acoust. Soc. Kr. 37, 469-474 (2018).

K. Ko, S. Park, and H. Ko, "Convolutional neural network based amphibian sound classification using covariance and modulogram" (in Korean), J. Acoust. Soc. Kr. 37, 60-65 (2018).

W. Oh, "Audio classification performance of CNN according to audio feature extraction methods" (in Korean), Proc. J. Acoust. Soc. Kr. Supple.2(s) 38, 64 (2019).

J. Salamon, C. Jacoby, and J. P. Bello, "A dataset and taxonomy for urban sound research," Proc. of the 22nd ACM International Conf. on Multimedia, 1041- 1044 (2014).

10.1145/2647868.2655045

J. Salamon and J. P. Bello, "Deep convolutional neural networks and data augmentation for environmental sound classification," IEEE Signal Process. Lett. 24, 279-283 (2017).

10.1109/LSP.2017.2657381

B. McFee, C. Raffel, D. Liang, D. Ellis, M. Mcvicar, E. Battenberg, and O. Nieto, "Librosa: Audio and music signal analysis in python," Proc. 14th Python Sci. Conf. 18-24 (2015).

10.25080/Majora-7b98e3ed-003

D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980 (2014).

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. F. -Fei, "ImageNet large scale visual recognition challenge," Int. J. Computer Vision, 115, 211-252 (2015).

10.1007/s11263-015-0816-y

K. Simonyan and A. Zisseman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556 (2015).

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen, "MobileNetV2: Inverted residuals and linear bottlenecks," Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 4510-4520 (2018).

10.1109/CVPR.2018.00474

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 39
No :3
Pages :143-149
Received Date : 2020-02-25
Revised Date : 2020-04-16
Accepted Date : 2020-04-22
DOI :https://doi.org/10.7776/ASK.2020.39.3.143

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue