Abstract:
In today's exponential growth of information data, it is undoubtedly a better choice for people to obtain effective data in a short period of time via automatic summary technology. Among them, how to extract key information from redundant and unstructured long text and make the extracted information concise and smooth is a key issue. The TextRank algorithm and improved algorithms such as SWTextRank have been widely used in the generation of extracted abstracts, but they have not effectively solved the redundancy problem that exists in extracted abstracts. Therefore, this paper proposes an automatic text summarization extraction algorithm based on Clause extraction (PTextRank). Firstly, the text is preprocessed and divided into sentences, after which Sinica Treebank (STB) is used to mark each sentence, and then set extraction units based on clause. Next, BERT is used to construct the title and feature vector for each clause, and then the similarity between the feature vectors of the clause is calculated and stored in the similarity matrix. Finally, the clause similarity matrix is adjusted according to the clause position and the similarity between the clause and the title, the calculation is iteratively made until convergence, and then, the clause with the highest score is selected as the final summary. Experiments and analysis show that PTextRank algorithm effectively avoids redundant information in multiple sentences, and compared to traditional TextRank and the improved SWTextRank, the accuracy of PTextRank in generating abstracts is improved by at least 6% , while the quality of the generated abstract is better. In PTextRank algorithm, clauses are used as extraction units, starting from finer-grained extraction units to avoid redundant information in multiple sentences.