基于子句抽取的文本摘要自动提取算法

朱兵兵; 罗飞; 罗勇军; 丁炜超; 黄浩

doi:10.14135/j.cnki.1006-3080.20221122001

基于子句抽取的文本摘要自动提取算法

An Automatic Text Summarization Algorithm Based on Clause Extraction

摘要

摘要: TextRank算法及SWTextRank等改进算法在抽取式摘要生成中得到了广泛的应用，但它们都没有有效地解决抽取式摘要所存在的冗余性问题。为此，提出一种基于子句抽取的文本摘要自动提取算法（PTextRank）。首先，使用Sinica Treebank(STB)对每个句子进行语法标记，进而基于子句设置抽取单元；接着，使用BERT（Bidirectional Encoder Representation from Transformers）构建标题和每个子句的特征向量，并计算子句特征向量间的相似性，将其存放在相似度矩阵中；最后结合子句位置、子句与标题的相似度等调整子句相似度矩阵，迭代计算直至收敛，进而选取得分最高的子句作为最终摘要。实验分析表明，PTextRank算法有效地避免了多个句子中存在的冗余信息，且相比于TextRank和SWTextRank，PTextRank生成摘要的准确率至少提高6%，同时生成的摘要质量更好。

Abstract: In today's exponential growth of information data, it is undoubtedly a better choice for people to obtain effective data in a short period of time via automatic summary technology. Among them, how to extract key information from redundant and unstructured long text and make the extracted information concise and smooth is a key issue. The TextRank algorithm and improved algorithms such as SWTextRank have been widely used in the generation of extracted abstracts, but they have not effectively solved the redundancy problem that exists in extracted abstracts. Therefore, this paper proposes an automatic text summarization extraction algorithm based on Clause extraction (PTextRank). Firstly, the text is preprocessed and divided into sentences, after which Sinica Treebank (STB) is used to mark each sentence, and then set extraction units based on clause. Next, BERT is used to construct the title and feature vector for each clause, and then the similarity between the feature vectors of the clause is calculated and stored in the similarity matrix. Finally, the clause similarity matrix is adjusted according to the clause position and the similarity between the clause and the title, the calculation is iteratively made until convergence, and then, the clause with the highest score is selected as the final summary. Experiments and analysis show that PTextRank algorithm effectively avoids redundant information in multiple sentences, and compared to traditional TextRank and the improved SWTextRank, the accuracy of PTextRank in generating abstracts is improved by at least 6% , while the quality of the generated abstract is better. In PTextRank algorithm, clauses are used as extraction units, starting from finer-grained extraction units to avoid redundant information in multiple sentences.

HTML全文

参考文献(17)

施引文献

资源附件(0)