Abstract:
Together with the continuous accumulation of text document, the text classification has received more and more attentions, since it can be used to automatically give a correct category mark for input text document. Feature selection is an important process of text classification, whose goal is to choose highly distinguishing features for improving the performance of a classifier. In this paper, we shall investigate the feature selection problem based on the filter that sorts the features by different feature selection metric and selects the features according to the sorting result. The traditional document frequency (DF) is a common feature selection metric via statistics, in which the number of documents containing feature is taken as the basis of selection, and the feature appearing in most documents will be thought to be important. However, this may result in that the features containing less category information are selected and the correlation of features and categories is ignored. Aiming at the above shortcoming, this paper proposes an improved feature ranking metric, termed as normalized document frequency (NDF). By taking the relativity between features and categories into account, this paper introduces two normalization factors, the number of documents based on category and the number of documents based on feature. The performance of NDF is compared with three well known feature ranking metrics including DF, odds ratio (OR), and chi squared (CHI) on news data set in different features dimensions using naive Bayes (NB) classifier. It is shown via these results that the NDF metric outperforms the three metrics in terms of Macro-F1 and the accuracy (ACC) by increasing 3.5%, 2.0%, and 3.4%, respectively. Therefore, the NDF metric can select valuable features that are more favorable for distinguishing text category and effectively improve the performance of text classification.