Performance Analysis of Audio Event Classification Using Deep Features under Adverse Acoustic Conditions
* Presenting author
Audio event classification has been traditionally performed by extracting standard features based on human perception, such as Mel-frequency cepstral coefficients (MFCCs). However, the trend followed in the last years is primarily based on the information provided by deep features, which are extracted from the responses to complex input patterns learned within deep neural networks. These have been shown to obtain, in general, better performance than the hand-crafted ones. In fact, deep features are known to provide good generalization properties to classify events not seen during training, and can even be extracted from raw audio data. Since the captured audio data is highly dependent on the acoustic properties of the auditory scene, it is important to assess the impact that adverse acoustic conditions have in the final classification performance. In this paper, we analyze the robustness of deep features under controlled acoustic conditions by simulating different degrees of background noise, reverberation and segmentation errors, as well as in a real-life scenario where more than one audio event can be present at the same time. Results show an acute degradation in the performance given the background noise and segmentation errors, which suggest room for improvement in terms of robustness to different scenarios.