Abstract:
In order to screen anticancer drug accurately and quickly by effective structure with less numbers of descriptors, this work utilizes correlation feature selection methods to enhance the structural description ability of molecular fingerprints or descriptors towards anticancer drugs. The drug information from Drugbank and Pubchem, two databases of chemical compounds, is collected by labeling each antitumor drug as 1 and each non-antitumor drug is labeled as 0. An unbalanced dataset including 200 antitumor medicines and 10940 non-antitumor medicines is collected and cleaned. Weighting coefficients as well as under-sampling methods are used to deal with the unbalanced dataset and obtain two different balanced datasets. RDKit molecular descriptors, MACC fingerprints, Mordred molecular descriptors of the medicines are calculated to describe the structural information of medicines. Correlation feature selection methods are employed to reduce the redundancy among these molecular fingerprints or descriptors. By combining with decision tree, Pearson correlation coefficient and chi-squared
\chi^2 test are used as the correlation feature selection to simplify the above structural molecular descriptors and select the best combination of featured structure with satisfactory screening performance. According to this results, the identification ability of antitumor drugs is enhanced through the feature selection. Furthermore, the combination of 10 featured MACC fingerprint shows the best performance with about 81% of the identified antitumor medicines. The best structural combination with anticancer effect is selected. As a conclusion, the above feature selection methods can effectively simplify molecular fingerprints or descriptors and better screen antitumor drugs.