Abstract:
Since the cover version may differ from the original version in various respects, such as timbre, tempo, structure, key, arrangement, and even the language of the vocals, it will be a challenging work for automatically identifying all cover versions for a given original version. Most of the conventional cover song identification (CSI) schemes adopt hand-crafted features, which are highly customizable and effective. However, their shallow processing strategy and linear mapping cannot precisely describe the complex dynamic characteristics contained in the music. To deal with this problem, the deep-learning architecture has been recently introduced in some music feature extraction algorithms for achieving good results. However, it is noted that the performance of the deep-learning based schemes totally depend on the size of the training set such that the easily fall into local optimum. In this paper, by analyzing the complementarity between the hand-craft feature and deep-learning feature by experiment, we propose a feature fusion model. Firstly, a deep learning model is trained to extract deep pitch class profile (DPCP) feature. Meanwhile, a hand-crafted model is utilized to extract the main melody (MLD) feature. And then, the DPCP-based similarity score and MLD-based one are calculated via Dmax and the similarity scores are used to construct a similarity function. Furthermore, the two similarity scores are used to construct a similarity vector, by which an improved support vector machine (SVM) is given to obtain the probability that the input track pair belongs to reference/cover pair. Finally, in terms of the receiver operating characteristic (ROC) curve and the area under curve (AUC), the proposed model is compared with the state-of-the-art CSI schemes based on single feature and multiple features, respectively. It is shown from experimental results that the proposed scheme outperforms the CSI schemes based on hand-crafted feature and deep learning feature, respectively, and has the common and complementary properties in hand-crafted feature and deep-learning feature.