Dynamics of Civil Structures, Volume 2

102 E. M. Tronci et al. Fig. 2 Power spectrum of an undamaged record for the first sensor. In red the original frequency scale of the structural dataset, in blue the modified scale obtained after frequency domain transformation domains. In this work, the structural system presents records sampled at 400 Hz, while the VoxCeleb utterances are recorded at 16 kHz. To achieve an equivalence in terms of frequency content between the domains, a transformation procedure is implemented to bring the structural frequency content to the same frequency range of the audio records (Fig. 2). To build a richer audio dataset, both the audio and structural datasets have been augmented. In the VoxCeleb dataset, the records are augmented with reverberation, noise, music, and babble using the MUSAN dataset. Once the corrupted records are created, they are combined with the original clean data so that the trained model can learn and discriminate between cleaned signals and signal with noise and disturbances. For the structural dataset, the original raw signal for the numerical simulation is corrupted with white Gaussian noise characterized by 10% RMS, and in each simulation, a small perturbation on the mass and stiffness of each degree of freedom is introduced. The dataset adopted in the training phase consists of two groups: the first group collects the audio features to train the first part of the model, and the second group consists of records belonging to the structural system. The VoxCeleb dataset will be used entirely for the training, while for the structural systems, only 80% of the records will be adopted for the training. The remaining 20% of the structural data for the creation of the test set. Feature Extraction In the present work, the Mel Frequency Cepstral Coefficients [12] are adopted as damage sensitive features. For both the VoxCeleb audio domain and for the structural target domain, which is properly transformed, the MFCC vector consists of 30 coefficients per frame. Besides the Mel Frequency Cepstral Coefficients, another set of features is considered and added into the feature vector used to train the classification model, pitch, delta-pitch, and probability of voicing features. Pitch is a perceived quantity related to the fundamental frequency of vibration of the system to which it is referred to. The final damage sensitive feature vector adopted is a 33-dimensional vector. The extraction process is implemented for every record collected in the audio dataset and structural measurements for the training set, and the structural records collected in the test dataset. Training Phase The features extracted in the previous step are prepared for the training process. They are randomly selected and assigned to the proper tags of classes, creating the dataset, which will be the input for a neural network model. This process is done for both the structural dataset and the audio set. Then, the cepstral mean normalization procedure is applied to the features to make them all zero-mean and remove the convolved noise within the signal. Additionally, for the audio dataset, the possible silence frames are removed. In [16], the authors show that training a PLDA classifier on fixed-length embeddings extracted from the higher layers of a speaker recognition TDNN (which they refer to as “x-vectors”) achieves superior performance on out-of-class speaker recognition. Ananthram et al. [17] successfully implement a similar strategy for automated emotion detection in speech. Following an equivalent approach, the classification task is implemented here, assuming that such a network learns dense representations of speech segments in its upper layers and that these abstract information can be used to classify later the structural health condition of the 12DOF system. The features are derived from the audio domain and used to pre-train a TDNN architecture on a speaker recognition task [12]. Here, the same 9layer architecture and training methodology adopted in [16] is implemented, using the training script published as part of the Kaldi toolkit [18]. Time-Delay Neural Network is a multilayer artificial neural network architecture able to capture an unknown system’s dynamics by modeling a flexible-structured network that will imitate the system by adaptively changing its parameters. This architecture maps a finite time sequence into a single output. Each layer of a TDNN processes a context window from the previous layer, which means that lower layers will have a smaller receptive field and therefore model local features, and higher layers will have a bigger receptive field and thus model long-term dependencies from the slice of

RkJQdWJsaXNoZXIy MTMzNzEzMQ==