高级检索

  • ISSN 1006-3080
  • CN 31-1691/TQ

基于残差注意力U-Net结构的端到端歌声分离模型

汪斌 陈宁

汪斌, 陈宁. 基于残差注意力U-Net结构的端到端歌声分离模型[J]. 华东理工大学学报(自然科学版), 2021, 47(5): 619-626. doi: 10.14135/j.cnki.1006-3080.20200903001
引用本文: 汪斌, 陈宁. 基于残差注意力U-Net结构的端到端歌声分离模型[J]. 华东理工大学学报(自然科学版), 2021, 47(5): 619-626. doi: 10.14135/j.cnki.1006-3080.20200903001
WANG Bin, CHEN Ning. An End-to-End Singing Voice Separation Model Based on Residual Attention U-Net[J]. Journal of East China University of Science and Technology, 2021, 47(5): 619-626. doi: 10.14135/j.cnki.1006-3080.20200903001
Citation: WANG Bin, CHEN Ning. An End-to-End Singing Voice Separation Model Based on Residual Attention U-Net[J]. Journal of East China University of Science and Technology, 2021, 47(5): 619-626. doi: 10.14135/j.cnki.1006-3080.20200903001

基于残差注意力U-Net结构的端到端歌声分离模型

doi: 10.14135/j.cnki.1006-3080.20200903001
基金项目: 国家自然科学基金面上项目(61771196)
详细信息
    作者简介:

    汪斌:汪 斌(1996—),女,安徽人,硕士生,主要研究方向为音乐源分离和音频信号处理。E-mail:y45180173@mail.ecust.edu.cn

    通讯作者:

    陈 宁,E-mail:chenning_750210@163.com

  • 中图分类号: TP391

An End-to-End Singing Voice Separation Model Based on Residual Attention U-Net

  • 摘要: 歌声分离是音乐信息检索领域最具挑战的任务之一,本文对基于Wave-U-Net的歌声分离模型进行了改进以增强其性能。首先,在Wave-U-Net的编码和解码块中设计并引入了残差单元以增强其特征提取的有效性和训练效率;然后,在Wave-U-Net的跳跃连接部分设计并引入了注意力门控机制以减少从编码块对应层提取的特征和来自解码块上一层特征之间的语义鸿沟。在MUSDB18数据集上的实验结果表明:本文提出的RA-WaveUNet模型在分离性能上优于传统的Wave-U-Net模型;采用残差单元和注意力门控机制均有助于提高模型的性能。

     

  • 图  1  RA-WaveUNet模型框图

    Figure  1.  Block diagram of RA-WaveUNet model

    图  2  普通神经单位与3种不同残差单元的对比

    Figure  2.  Comparison between the plain neural unit and three different kinds of residual units

    图  3  注意力门控结构

    Figure  3.  Architecture of attention gate

    图  4  不同层数M4-R3模型的训练参数数量对比

    Figure  4.  Parameter numbers comparison of M4-R3 models with different numbers of layers

    图  5  不同层数M4-R3模型的性能对比

    Figure  5.  Performances achieved by M4-R3 models with different numbers of layers

    表  1  RA-WaveUNet模型结构细节

    Table  1.   Architecture details of RA-WaveUNet model

    BlockOperationOutput shape
    Input16×384×2
    Encoding block $i$,E-Residual unit $i$16×240
    $i = 1,\cdots,10$Decimation
    Bridge blockResidual unit 1116×264
    Decoding block $i$,Linear interpolation32×264
    32×504
    Concat(Att(E-Residual unit $i$))
    $i = 10,\cdots,1$D-Residual unit $i$
    Output16384×2
    下载: 导出CSV

    表  2  引入不同类型残差单元后Wave-U-Net的性能对比

    Table  2.   Performance comparison of Wave-U-Net with different types of residual units

    SchemesVocalsAccompaniment
    Med./dBMAD/dBMean/dBSD/dBMed. /dBMAD/dBMean/dBSD/dB
    M44.463.210.6513.6710.693.1511.857.03
    M4-R14.633.301.1313.1110.733.1012.217.09
    M4-R24.493.150.3414.0510.473.0111.726.77
    M4-R35.043.341.4313.2710.933.0912.406.90
    下载: 导出CSV

    表  3  BN层对源分离性能的影响

    Table  3.   Influence of BN layer on the separation performance

    SchemesVocalsAccompaniment
    Med. /dBMAD/dBMean/dBSD/dBMed. /dBMAD/dBMean/dBSD/dB
    M44.463.210.6513.6710.693.1511.857.03
    M4-R14.633.301.1313.1110.733.1012.217.09
    M4-R1+BN4.503.22014.8410.562.9811.467.16
    M4-R24.493.150.3414.0510.473.0111.726.77
    M4-R2+BN4.383.23−0.5415.5710.382.9311.186.53
    M4-R35.043.341.4313.2710.933.0912.406.90
    M4-R3+BN4.793.310.2814.8410.853.0711.816.64
    下载: 导出CSV

    表  4  注意力门控机制对性能的影响

    Table  4.   Contribution of the attention gate to the performance

    SchemesVocalsAccompaniment
    Med. /dBMAD/dBMean/dBSD/dBMed. /dBMAD/dBMean/dBSD/dB
    M44.463.210.65 13.6710.693.1511.857.03
    M4-A4.523.270.9113.2910.723.0912.036.98
    M4-R3-104.893.331.2813.2710.933.0912.286.84
    RA-WaveUnet4.993.281.5413.0910.973.0912.386.96
    下载: 导出CSV

    表  5  与最新SVS模型的性能对比

    Table  5.   Performance comparison with state-of-the-arts SVS models

    SchemesVocalsAccompaniment
    Med. /dBMAD/dBMean/dBSD/dBMed. /dBMAD/dBMean/dBSD/dB
    M44.463.210.6513.6710.693.1511.857.03
    MHE0[13]4.693.240.7513.9110.883.1312.106.77
    HydraNet+ H7[14]1.664.7510.712.90
    U310[17]4.843.331.0913.5710.913.1412.266.84
    RA-WaveUnet4.993.281.5413.0910.973.0912.386.96
    下载: 导出CSV
  • [1] LI Y P, WANG D L. Separation of singing voice from music accompaniment for monaural recordings[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(4): 1475-1487. doi: 10.1109/TASL.2006.889789
    [2] SALAMON J, GOMEZ E, ELLIS D, et al. Melody extraction from polyphonic music signals: Approaches, applications, and challenges[J]. IEEE Signal Processing Magazine, 2014, 31(2): 118-134. doi: 10.1109/MSP.2013.2271648
    [3] KUM S, NAM J. Joint detection and classification of singing voice melody using convolutional recurrent neural networks[J]. Applied Sciences, 2019, 9(7): 1324-1341. doi: 10.3390/app9071324
    [4] YOU S D, LIU C H, CHEN W K. Comparative study of singing voice detection based on deep neural networks and ensemble learning[J]. Human-Centric Computing and Information Sciences, 2018, 8(1): 34-50. doi: 10.1186/s13673-018-0158-1
    [5] SHARMA B, DAS R K, LI H Z. On the importance of audio-source separation for singer identification in polyphonic music[C]//Conference of the International Speech Communication Association (INTERSPEECH). Graz, Austria: IEEE, 2019: 2020-2024.
    [6] PABLO S, ALEXANDER M B, GUILLERMO S. Real-time online singing voice separation from monaural recordings using robust low-rank modeling[C]//International Society for Music Information Retrieval (ISMIR). Porto, Portugal: INESC TEC, 2012: 67-72.
    [7] IKEMIYA Y, YOSHII K, ITOYAMA K. Singing voice analysis and editing based on mutually dependent F0 estimation and source separation[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brisbane, Australia: IEEE, 2015: 574-578.
    [8] 赵天坤. 基于深度神经网络的音乐信息检索[D]. 北京: 北京邮电大学, 2015.
    [9] SIMPSON A J R, ROMA G, PLUMBLEY M D. Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network[C]// 12th International Conference on Latent Variable Analysis and Signal Separation (LVA). Czech Republic: Springer, 2015: 429-436.
    [10] RONNEBERGER O, FISCHER P, BROX T. U-net: Convolutional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Munich, Germany: Springer, 2015: 234-241.
    [11] JANSSON A, HUMPHREY E J, MONTECCHIO N, et al. Singing voice separation with deep U-Net convolutional networks[C]//Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou, China: National University of Singapore, 2017: 323-332.
    [12] STOLLER D, EWERT S, DIXON S. Wave-u-net: A multi-scale neural network for end-to-end audio source separation[C]//International Society for Music Information Retrieval (ISMIR). Paris, France: Télécom ParisTech and IRCAM, 2018: 334-340.
    [13] JOAQUIN PEREZ-LAPILLO, OLEKSANDR GALKIN, TILLMAN WEYDE. Improving singing voice separation with the Wave-U-Net using minimum hyperspherical energy[C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Virtual Barcelona: IEEE, 2020: 3272-3276.
    [14] KASPERSEN E T, KOUNALAKIS T, ERKUT C. Hydranet: A real-time waveform separation network[C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Virtual Barcelona: IEEE, 2020: 4327-4331.
    [15] HE K, ZHANG X, REN S, et al Deep residual learning for image recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Cecognition (CVPR). Las Vegas, America: IEEE, 2016: 770-778.
    [16] IBTEHAZ N, RAHMAN M S. Multiresunet: Rethinking the U-net architecture for multimodal biomedical image segmentation[J]. Neural Networks, 2020, 121(1): 74-81.
    [17] CHEN B W, HSU Y M, LEE H Y. J-Net: Randomly weighted U-Net for audio source separation[EB/OL]. avXiv.org, (2019-11-29)[2020-08-30]. https://avXiv.org/pdf/1911.12926v1.pdf.
    [18] JO SCHLEMPER, OZAN Oktay, MICHIEL SCHAAP, et al. Attention gated networks: Learning to leverage salient regions in medical images[J]. Medical Image Analysis, 2019, 53(1): 197-207.
    [19] SAUMYA J, NICHOLAS A L, LEE N, et al. Learn to pay attention[C]//Proceedings of International Conference on Learning Representation (ICLR). Vancouver, Canada: IEEE, 2015: 1-14.
    [20] ZAFAR RAFII, ANTOINE LIUTKUS, FABIAN-ROBERT STÖTER, et al. MUSDB18: A corpus for music separation[EB/OL]. (2019-06-30)[2020-08-30]. https://hal.inria.fr/hal-02190845/document.
    [21] LIUTKUS A, FITZGERALD D, RAFIFII Z. Scalable audio separation with light kernel additive modelling[C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brisbane, Australia: IEEE, 2015: 76-80.
    [22] VINCENT E, GRIBONVAL R, FEVOTTE C. Performance measurement in blind audio source separation[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(4): 1462-1469. doi: 10.1109/TSA.2005.858005
    [23] KINGMA D P, BA J. Adam: A method for stochastic optimization[C]//The 3rd International Conference for Learning Representations (ICLR). San Diego, USA: IEEE, 2015: 1-15.
  • 加载中
图(5) / 表(5)
计量
  • 文章访问数:  778
  • HTML全文浏览量:  506
  • PDF下载量:  29
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-09-03
  • 网络出版日期:  2020-12-16
  • 刊出日期:  2021-10-11

目录

    /

    返回文章
    返回