高级检索

    赵鸿山, 范贵生, 虞慧群. 基于归一化文档频率的文本分类特征选择方法[J]. 华东理工大学学报(自然科学版), 2019, 45(5): 809-814. DOI: 10.14135/j.cnki.1006-3080.20180914005
    引用本文: 赵鸿山, 范贵生, 虞慧群. 基于归一化文档频率的文本分类特征选择方法[J]. 华东理工大学学报(自然科学版), 2019, 45(5): 809-814. DOI: 10.14135/j.cnki.1006-3080.20180914005
    ZHAO Hongshan, FAN Guisheng, YU Huiqun. Text Classification Method Based on Normalized Document Frequency Feature Selection[J]. Journal of East China University of Science and Technology, 2019, 45(5): 809-814. DOI: 10.14135/j.cnki.1006-3080.20180914005
    Citation: ZHAO Hongshan, FAN Guisheng, YU Huiqun. Text Classification Method Based on Normalized Document Frequency Feature Selection[J]. Journal of East China University of Science and Technology, 2019, 45(5): 809-814. DOI: 10.14135/j.cnki.1006-3080.20180914005

    基于归一化文档频率的文本分类特征选择方法

    Text Classification Method Based on Normalized Document Frequency Feature Selection

    • 摘要: 特征选择是文本分类的一个重要过程,对分类性能的提升发挥着重要的作用。传统的文档频率(Document Frequency,DF)特征选择指标只是从全局的角度统计包含特征的文档数作为选择的依据,没有考虑特征与类别的相关性。针对该问题,本文从特征和类别的相关性出发,对文档频率分别进行局部和全局的归一化处理,提出了一种归一化文档频率(Normalized Document Frequency,NDF)的特征选择指标,并在不同的特征维度下验证特征选择对文本分类性能的影响。结果表明,应用NDF特征选择指标可以得到更高的分类准确率和Macro-F1值。因此,对文档频率进行归一化处理可以更好地选择出有价值的特征,有效提升文本的分类性能。

       

      Abstract: Together with the continuous accumulation of text document, the text classification has received more and more attentions, since it can be used to automatically give a correct category mark for input text document. Feature selection is an important process of text classification, whose goal is to choose highly distinguishing features for improving the performance of a classifier. In this paper, we shall investigate the feature selection problem based on the filter that sorts the features by different feature selection metric and selects the features according to the sorting result. The traditional document frequency (DF) is a common feature selection metric via statistics, in which the number of documents containing feature is taken as the basis of selection, and the feature appearing in most documents will be thought to be important. However, this may result in that the features containing less category information are selected and the correlation of features and categories is ignored. Aiming at the above shortcoming, this paper proposes an improved feature ranking metric, termed as normalized document frequency (NDF). By taking the relativity between features and categories into account, this paper introduces two normalization factors, the number of documents based on category and the number of documents based on feature. The performance of NDF is compared with three well known feature ranking metrics including DF, odds ratio (OR), and chi squared (CHI) on news data set in different features dimensions using naive Bayes (NB) classifier. It is shown via these results that the NDF metric outperforms the three metrics in terms of Macro-F1 and the accuracy (ACC) by increasing 3.5%, 2.0%, and 3.4%, respectively. Therefore, the NDF metric can select valuable features that are more favorable for distinguishing text category and effectively improve the performance of text classification.

       

    /

    返回文章
    返回