高级检索

  • ISSN 1006-3080
  • CN 31-1691/TQ

一种结合主题模型与段落向量的短文本聚类方法

饶毓和 凌志浩

饶毓和, 凌志浩. 一种结合主题模型与段落向量的短文本聚类方法[J]. 华东理工大学学报(自然科学版), 2020, 46(3): 419-427. doi: 10.14135/j.cnki.1006-3080.20190430001
引用本文: 饶毓和, 凌志浩. 一种结合主题模型与段落向量的短文本聚类方法[J]. 华东理工大学学报(自然科学版), 2020, 46(3): 419-427. doi: 10.14135/j.cnki.1006-3080.20190430001
RAO Yuhe, LING Zhihao. A Short Text Clustering Method Combining Topic Model and Paragraph Vector[J]. Journal of East China University of Science and Technology, 2020, 46(3): 419-427. doi: 10.14135/j.cnki.1006-3080.20190430001
Citation: RAO Yuhe, LING Zhihao. A Short Text Clustering Method Combining Topic Model and Paragraph Vector[J]. Journal of East China University of Science and Technology, 2020, 46(3): 419-427. doi: 10.14135/j.cnki.1006-3080.20190430001

一种结合主题模型与段落向量的短文本聚类方法

doi: 10.14135/j.cnki.1006-3080.20190430001
详细信息
    作者简介:

    饶毓和(1995-),男,硕士生,主要研究方向为模式识别。E-mail:auto_ryh@foxmail.com

    通讯作者:

    凌志浩,E-mail:zhhling@ecust.edu.cn

  • 中图分类号: TP391

A Short Text Clustering Method Combining Topic Model and Paragraph Vector

  • 摘要: 为了克服短文本的稀疏性和高维度性,同时提升文本聚类质量,提出了一种结合词对主题模型(Biterm Topic Model, BTM)与段落向量(Paragraph Vector, PV)的短文本聚类方法。该方法主要包括两个重要步骤:一是利用由词对主题模型所求出的词-文档-主题概率分布,并结合局部离群因子与JS散度对整个文本集合中的词语进行语义拆分;二是将经过词语语义拆分后的文本输入至向量化模型PV-DBOW(Distributed Bag of Words Version of Paragraph Vector)得到段落向量,并将其与对应的文档-主题概率分布拼接起来构成文本特征向量。实验结果表明,本文方法得到的特征向量对短文本具有较强的区分能力,能有效改善短文本的聚类效果,同时也能避免受到短文本的稀疏性影响。

     

  • 图  1  短文本聚类过程

    Figure  1.  Process of short text clustering

    图  2  BTM概率图模型

    Figure  2.  Probabilistic graphical model of BTM

    图  3  聚类结果与随机选取出来进行拆分的词语数目之间的关系

    Figure  3.  Relation between clustering results and numbers of randomly selected words to be splited

    图  4  本文方法与Doc2Vec的精确率对比(搜狗语料)

    Figure  4.  Precision comparison between DT2Vec and Doc2Vec (Sougou Corpus)

    图  5  本文方法与Doc2Vec的召回率对比(搜狗语料)

    Figure  5.  Recall rate comparison between DT2Vec and Doc2Vec (Sougou Corpus)

    表  1  搜狗语料聚类结果对比

    Table  1.   Comparison of clustering results of Sogou corpus

    MethodV-measure/%F1/%
    Macro-F1Micro-F1
    WM2Vec61.97377.41777.565
    SIF2Vec61.70676.83677.266
    Doc2Vec67.68983.23883.397
    BTM2Vec64.63373.89574.707
    DT2Vec71.07884.99285.091
    下载: 导出CSV

    表  2  复旦语料聚类结果对比

    Table  2.   Comparison of clustering results of Fudan corpus

    MethodV-measure/%F1/%
    Macro-F1Micro-F1
    WM2Vec52.31169.88269.748
    SIF2Vec40.55659.50157.863
    Doc2Vec59.66782.09182.207
    BTM2Vec61.75581.00481.568
    DT2Vec64.71183.85784.152
    下载: 导出CSV
  • [1] YIN J H, WANG J Y. A Dirichlet multinomial mixture model-based approach for short text clustering[C]//The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 233-242.
    [2] HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the clustering of short texts using world knowledge[C]//The 18th ACM Conference on Information and Knowledge Management. New York: ACM, 2009: 919-928.
    [3] WANG L, JIA Y, HAN W H. Instant message clustering based on extended vector space model[C]//Advances in Computation and Intelligence. Berlin: Springer, 2007: 435-443.
    [4] BANERJEE S, RAMANATHAN K, GUPTA A. Clustering short texts using Wikipedia[C]//The 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2007: 787-788.
    [5] JIN O, LIU N N, ZHAO K, et al. Transferring topical knowledge from auxiliary long texts for short text clustering[C]//The 20th ACM Conference on Information and Knowledge Management. New York: ACM, 2011: 775-784.
    [6] JIA C Y, MATTHEW B C, WANG X Y, et al. Concept decompositions for short text clustering by identifying word communities[J]. Pattern Recognition, 2018, 76: 691-703. doi: 10.1016/j.patcog.2017.09.045
    [7] XU J, XU B, WANG P, et al. Self-taught convolutional neural networks for short text clustering[J]. Neural Networks, 2017, 88: 22-31. doi: 10.1016/j.neunet.2016.12.008
    [8] ZHENG C T, LIU C, WONG H S. Corpus-based topic diffusion for short text clustering[J]. Neurocomputing, 2018, 275: 2444-2458. doi: 10.1016/j.neucom.2017.11.019
    [9] 刘欣, 佘贤栋, 唐永旺, 等. 基于特征词向量的短文本聚类算法[J]. 数据采集与处理, 2017, 32(5): 1052-1060.
    [10] YAN X, GUO J, LAN Y, et al. A biterm topic model for short texts[C]//The 22nd International Conference on World Wide Web. New York: ACM, 2013: 1445-1456.
    [11] BARZ BJÖRN, RODNER E, GARCIA Y G, et al. Detecting regions of maximal divergence for spatio-temporal anomaly detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(5): 1088-1101.
    [12] MARKUS M B, HANS-PETER K, RAYMOND T N, et al. LOF: Identifying density-based local outliers[C]//The 2000 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2000: 93-104.
    [13] GOLDBERG Y, HIRST G. Neural Network Methods for Natural Language Processing[M].California: Morgan & Claypool, 2017.
    [14] LE Q V, Mikolov T. Distributed representations of sentences and documents[C]//International Conference on Machine Learning. Beijing: ACM, 2014: 1188-1196
    [15] ARORA S, LIANG Y Y, MA T Y. A simple but tough-to-beat baseline for sentence embeddings[C]// 5th International Conference on Learning Representations. Toulon, France: ICLR, 2017: 1-16.
  • 加载中
图(5) / 表(2)
计量
  • 文章访问数:  10421
  • HTML全文浏览量:  3072
  • PDF下载量:  67
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-04-30
  • 网络出版日期:  2019-06-27
  • 刊出日期:  2020-06-01

目录

    /

    返回文章
    返回