104 E. M. Tronci et al. Table 1 Results for the different classification tasks considering the VoxCeleb dataset with and without the noise augmentation with the MUSAN dataset for case A and case B VoxCeleb unaugmented VoxCeleb noise augmented Task EER [%] Accuracy [%] EER[%] Accuracy [%] Binary classification (case A) 3.08 97.15 3.08 96.77 Multiclass classification (case A) 3.85 96.02 4.23 95.64 Binary classification (case B) 23.13 76.90 21.06 78.99 Multiclass classification (Case B) 18.32 81.44 17.45 82.47 0.1 1 10 20 40 60 100 False Alarm Probability [%] (a) 0.1 1 10 20 40 60 100 Binary EER Binary Multiclass EER Multiclass 0.1 1 10 20 40 60 100 False Alarm Probability [%] (b) 0.1 1 10 20 40 60 100 Miss Probability [%] Binary EER Binary Multiclass EER Multiclass 0.1 1 10 20 40 60 100 False Alarm Probability [%] (a) 0.1 1 10 20 40 60 100 Miss Probability [%] Binary EER Binary Multiclass EER Multiclass 0.1 1 10 20 40 60 100 False Alarm Probability [%] (b) 0.1 1 10 20 40 60 100 Miss Probability [%] Binary EER Binary Multiclass EER Multiclass Fig. 3 DET curves for the binary and multiclass tasks, considering the dataset in case A: (a) unaugmented VoxCeleb dataset; (b) noise augmented VoxCeleb dataset Structural Binary and Multiclass Classification for Case A The model trained on the dataset for case A performs with a high accuracy both in the binary and multiclass classification tasks. In this configuration only the undamaged and first two damaged conditions are considered in the training. In the binary classification (Fig. 3) the model achieves an EER of 3.08% (LDA dimension = 3) considering both the VoxCeleb dataset without noise augmentation and its augmented version. In the multiclass classification (Fig. 3) the model achieves an EER of 3.85% and 4.23% (LDA dimension = 3) considering, respectively, the VoxCeleb dataset without noise augmentation and its augmented version. It is evident, how in these classification tasks, associated with the dataset in case A, the addition of noise in the audio domain does not lead to a better performance. Figure 4 presents the LDA-transformed x-vectors for the three LDA dimensions in the augmented VoxCeleb dataset. As expected the damaged and undamaged scenarios are correctly separated in the binary classification, and even when the discrimination gets more granular in the multiclass case, the two damage classes are correctly classified independently. It is evident how the first and second LDA dimensions, which are associated with the biggest eigenvalues, play a key role in separating the x-vectors features related to the two classes. Structural Binary and Multiclass Classification for Case B The binary classification task implemented for the dataset in case B leads to an EER of 23.13% (LDA dimension = 5) when considering the VoxCeleb dataset without noise augmentation and to an EER of 21.06% (LDA dimension = 6) when the MUSAN dataset is considered for augmenting the dataset (Fig. 5). The results demonstrate the capability achievable in the classification task by training the TDNN architecture on the audio source. Figure 6 presents the LDA-transformed
RkJQdWJsaXNoZXIy MTMzNzEzMQ==