In this work, we propose an approach that features deep feature embedding
learning and hierarchical classification with triplet loss function for
Acoustic Scene Classification (ASC). In the one hand, a deep convolutional
neural network is firstly trained to learn a feature embedding from scene audio
signals. Via the trained convolutional neural network, the learned embedding
embeds an input into the embedding feature space and transforms it into a
high-level feature vector for representation. In the other hand, in order to
exploit the structure of the scene categories, the original scene
classification problem is structured into a hierarchy where similar categories
are grouped into meta-categories. Then, hierarchical classification is
accomplished using deep neural network classifiers associated with triplet loss
function. Our experiments show that the proposed system achieves good
performance on both the DCASE 2018 Task 1A and 1B datasets, resulting in
accuracy gains of 15.6% and 16.6% absolute over the DCASE 2018 baseline on Task
1A and 1B, respectively.