高级检索

    基于MCSP和Swin Transformer的转录因子结合位点预测模型

    Transcription Factor Binding Site Prediction Model Based on MCSP and Swin Transformer

    • 摘要: 预测转录因子结合位点(Transcription Factor Binding Sites,TFBS)可以帮助识别特定细胞和组织的特异性调控机制,对于理解基因表达调控机制至关重要。现有方法结合DNA的序列和形状信息进行TFBS的预测,生成的形状信息未考虑长侧翼核苷酸的影响,在对序列信息进行特征提取时忽略了不同通道间特征的互补性,模型的预测能力有待提高。本文提出了TFBS预测模型MSSW,考虑长侧翼核苷酸来生成形状信息;利用Swin Transformer提取形状信息中长程依赖和局部关联相结合的特性,将分裂注意力融入多尺度卷积神经网络(Multi-scale Convolution and Split attention,MCSP)来捕获序列中不同通道间特征的互补性,获得跨通道的多尺度序列特征;结合提取的高级序列和形状特征进行TFBS的预测。结果表明,MSSW模型优于现有TFBS预测模型,可有效预测TFBS。

       

      Abstract: Predicting transcription factor binding sites (TFBS) can help identify specific regulatory mechanisms of cells and tissues, which is crucial for understanding gene expression regulation mechanisms. The existing methods combine DNA sequence and shape information for TFBS prediction, but they typically focus only on neighboring nucleotides to generate shape information, neglecting the influence of longer flanking nucleotides. In the sequence processing branch, these methods neglect the complementarity of features across different channels. Similarly, in the shape processing branch, local correlations and long-range dependencies of shape information are not adequately captured. This lack of deep exploration of both sequence and shape information limits prediction performance. To address these issues, this paper proposes a novel model, MSSW, for predicting transcription factor binding sites. Firstly, Deep DNAshape is used to generate long flanking shape information for the shape branch, considering a more comprehensive set of shape data. Additionally, the Swin Transformer is utilized for feature extraction of the shape information, capturing local correlations through window-based self-attention and obtaining long-range dependency information through window movement. Furthermore, the multi-scale convolution and split attention (MCSP) are employed to extract multi-scale cross-channel features from the sequence. Meanwhile, the sequence and shape features are fused to predict transcription factor binding sites. Finally, MSSW is evaluated on 165 ChIP-seq datasets. The experimental results show that it is superior to existing TFBS prediction models and ablation studies validate the effectiveness of MCSP and the Swin Transformer. Additionally, the model's generalization is verified across different cell lines, providing valuable insights for predicting TFBS in various cellular contexts. The proposed model achieves strong predictive performance across datasets of different scales, particularly excelling with medium and small-sized datasets.

       

    /

    返回文章
    返回