Transfer Learning for Music Genre Classification

. Modern music information retrieval system provides high-level features (genre, instrument, mood and so on) for searching and recommending conveniently. Among these music tags, genre is the most widely used in practice. Machine learning technique has the ability of cataloguing diﬀerent genres from raw music. A disadvantage of it is that the ﬁnal performance heavily depends on the used features. As a powerful learning algorithm, deep neural network can extract useful features automatically and eﬀectively instead of time-consuming feature engineering. But deeper architecture means larger data are needed to train the neural network. In many cases, we may not have enough data to train a deep network. Transfer learning solves the problem by pre-training the network in a similar task which has enough data, then ﬁne-tuning the parameters of the pre-trained network using the target dataset. Magnata-gatune dataset is used for pre-training the proposed ﬁve-layer Recurrent Neural Network (RNN) with Gated Recurrent Unit (GRU). And in order to reduce the input of the network, scattering transform is used in this paper. Then GTZAN dataset is used as the target dataset of genre classiﬁcation. Experimental results show the transfer learning way can achieve a higher average classiﬁcation accuracy (95.8%) than the same deep RNN which initials the parameters randomly (93.5%). In addition, the deep RNN using transfer learning converges to the ﬁnal accuracy faster than using random initialization.


Introduction
Music genre is important to many applications, such as music recommender system and information retrieval.Automatic genre classification system has been developed using machine learning technique recent years.Most of these systems have the ability of cataloguing different music genres from raw music contents [1][2][3].
Mel-frequency cepstral coefficient (MFCC) and Mel-spectrogram are widely used in genre classification task.Because they can extract variant features from raw data for the learning process.But the performance of genre classification benefits from features over long-time scale (>500ms) while MFCC is efficient around time scale of 25ms, and enlarging the time scale leads the information loss when using mel-spectrogram [4,5].Differently, scattering transform can recover the information loss by wavelet decompositions, meanwhile, extract long-time scale features by lowpass filters [6,7].
Deep learning makes massive of success in different areas, for instance, computer vision [8][9][10], speech recognition [11,12], and natural language processing [13,14].These algorithms can extract high-level features automatically layer by layer, different from traditional machine learning classifiers, such as Support Vector Machine (SVM), Nearest Neighbors, and Decision Trees, which are heavily dependent on the result of feature extraction.Among its several typical models, Recurrent Neural Network (RNN) is widely used for sequential data.And RNN is good at learning the relationship through time [15].But in purpose of achieving good performance, deep neural network needs large amount of data.In condition of the target dataset need to be classified is not enough, we can use a large data, which is the same or similar to the target dataset, to pre-train the deep neural network, then replace the connections to classifier according to the target classification number and fine-tune the parameters of the pre-trained network.This process is called transfer learning [16].In this paper, we use Magnatagatune dataset [17] and GTZAN dataset [18] as the large and the target dataset respectively.5-layer RNN using Gated Recurrent Unit (GRU) [19] and softmax classifier are used.Additionally, for reducing the input of deep RNN, we use scattering transform as its preprocessing.
The results of the experiment show that the proposed 5-layer RNN reaches a high accuracy when using transfer learning, and the same architecture using random initialization converges more slowly to a lower accuracy.

Transfer Learning Process
The architecture of the proposed method is shown in Figure 1.The overall process consists of two parts.One part is deep RNN training on a large musical dataset (Magnatagatune dataset is used in this paper).The other part is genre classification process after fine-tuning the previous trained deep RNN by target dataset (GTZAN dataset is used in this paper).Specifically, scattering transform is applied at the beginning of each part, in order to reduce the raw music data and to extract features preliminarily for the next process of neural network training.5-layer RNN with GRU and softmax classifier are trained with tagged music clips as the deep RNN we mentioned.At last, we use the target genre classification dataset (GTZAN) to fine-tune the trained parameters of RNN.

Scattering transform
In genre classification task, large time scale (>500ms) invariant signal representation is important.As widely used methods in audio processing, mel-spectrogram can enlarge the time scale but remove information which is crucial to genre For an audio signal x, scattering transform defined as S n x, where n represent the order.S 0 x = x φ(t) has locally invariant property because of the time averaging operation, but it leads to high frequency information loss which can be retrieved by the wavelet modulus coefficients |x ψ λ1 (t)|.To make the wavelet modulus coefficients invariant to translation, a time averaging is applied.The first layer of scattering transform defined as: J. Andén [7] indicates that if wavelets filter-bank ψ λ1 have the same frequency resolution as the mel-windows, then S 1 x coefficients can be approximate to the mel-filter-banks coefficients.The difference is that applying a bank of higher frequency wavelet filters ψ λ2 with a modulus to the wavelet modulus coefficients can recover the lost information.The same as previous operation, adding a lowpass filter φ(t) make the coefficients translation invariant.Then the second layer of scattering transform defined as:

Deep Recurrent Neural Network
RNNs have an aptitude for handling sequential information, such as speech recognition and NLP.RNN structure can be described as transitions from previous to current states.For classical RNN, this transition is formulized as: In order to solve the problem of vanishing gradients of RNN.Gated structure named LSTM introduced by Hochreiter [15].The LSTM unit allows that information of more timesteps can be memorized.And the memories are stored by memory cells.Then the LSTM can decide to forget, output, or change the saved memories.As a popular variant of LSTM, GRU is simpler and effective as well.It uses gate Zt and gate Rt to update the hidden state.Theses gates are given by: We use 5-layer GRU neural network which is constructed by stacking each hidden layer on the top of previous layer, in order to improve the ability of representation of our architecture in this paper.Additionally, generalization of the proposed deep RNN is improved by applying dropout between each layer [20].

Datasets and Experiment Setup
Magnatagatune and GTZAN dataset are used as the large and target dataset respectively.All the clips are transformed to mono and sampled by 16kHZ.Magnatagatune has 25863 clips and each clip is annotated with 188 different musical tags such as genre, mood, and instrument.We use the last 2105 clips (distributed in folder 'f') for validation, others for training.We use 512 hidden states in each layer.Dropout is set as 0.7.Learning rate is 0.00001.And we use AUC-ROC score [21] to evaluate the performance of our model to avoid imbalance of the dataset.When the AUC-ROC score is stable, we stop the training and save the model.GTZAN dataset has 1000 clips of 10 genres and each genre contains 100 clips evenly.As the target dataset, it is randomly shuffled and the mean accuracy of 10 times of 10-fold cross validation is used for the final test accuracy.Among the 10 folds in total, we use 1 fold for testing, and the others for training.Each time of 10-fold cross validation, we change the output number of the softmax classifier to 10 (the genre number of GTZAN dataset), then fine-tune the parameters of pre-trained model from Magnatagatune dataset.

Experiment Results and Analysis
As shown in fig.3, both random initialization and transfer learning models (pretraining process is shown in fig.2) of 5-layer RNN with GRU using scattering transform preprocessing converge to quite high accuracy in training.And the models using transfer learning need about 100 epochs to be stable.But the random initialed models need more.This phenomenon not only appears in the three random picked training processes, but also in the unpicked to be shown.It indicates that the transfer learning initials the model better, and improves the speed of convergence.
Comparing with other works of recent years in Tab. 1, our approach shows a competitive accuracy (95.8%) in genre classification task on GTZAN dataset.Even the model using random initialization can also reach a high accuracy (93.5%) relatively.The combination of scattering transform and deep RNN has been evaluated, and by using this architecture, it performs well in music genre classification.

Conclusion
In this paper, we use transfer learning in music genre classification by using 5-layer RNN with GRU and scattering coefficients as its input.When applying the transfer learning from a large music dataset (Magnatagatune is used in this paper), our model shows a faster convergence and higher average accuracy than the same model of random initialization on the target dataset (GTZAN is used in this paper).And the accuracy of transfer learning approach is competitive comparing with the state-of-the-art models as well.The effectiveness of deep RNN combined with scattering transform and transfer learning has been verified in music genre classification task.

Fig. 1 .
Fig. 1.The architecture of the proposed transfer learning process

Fig. 3 .
Fig. 3. Three random picked training processes of 10-cross validation, Blue lines represent the RNN using random initialization, and the orange lines represent the RNN using transfer learning.And the accuracy is tested by a random batch of training data

Table 1 .
Average test accuracy of different models on GTZAN dataset