Abstract:
The music source separation is to separate a piece of music into its individual sounds. As a specific case, the Singing Voice Separation (SVS) separates the music into vocals and accompaniment. Due to its potential applications in music melody extraction, music genre classification, singing voice detection, and singer identification, etc, SVS has been becoming a hot topic in the music information retrieval field in recent years. It is recently reported that a variety of convolutional neural network architectures based on U-Net has been successfully employed for the SVS task and the better performance can be achieved. Besides, Wave-U-Net is proposed to achieve the end-to-end SVS by analyzing the music waveform directly. However, the performance of the SVS approaches in the time-domain relies heavily on the quality of the feature extraction procedure. In this paper, the conventional Wave-U-Net based SVS scheme is modified to enhance its performance. Firstly, at the encoding and decoding blocks, a residual unit is designed and adopted to replace the plain neural unit to solve the degradation problem to some extent. Secondly, at the skip connection, an attention gate mechanism is introduced to reduce the semantic gap between the output of the previous layer in the decoding block and the one of the corresponding layer in the encoding block. To verify the effectiveness of the proposed scheme, termed as RA-WaveUNet, in the SVS task, its performances are compared with those of state-of-the-art schemes on the maximum open dataset MUSDB18. It is demonstrated from experimental results that the proposed scheme can achieve better performances than Wave-U-Net based ones and other SVS schemes. Moreover, both the above modifications contribute to the performance enhancement.