Abstract:
As well known, the singular value decomposition (SVD) is a useful tool for analyzing and dealing with high dimensional vectors, or even matrices in the field of information retrieval and image processing, due to its strong ability in reducing the dimensions of the question space. Latent semantic indexing (LSI) is a popular application of the SVD algorithm in information retrieval, which can overcome the two fundamental problems plaguing traditional lexical-matching indexing schemes, synonymy and polysemy. In order to deal with massive and rapidly growing data, some algorithms via incrementally updating LSI on new documents have been recently developed. In this paper, we propose an improved LSI algorithm for achieving faster computation speed without retrieval accuracy degradation, compared with universal LSI updating algorithms, e.g., Zha's and derivatives. There are two main time-consuming steps in this algorithm, the one of which is to compute the QR decomposition of incremental left vector matrix, and the other is to compute the SVD of intermediate matrix. It is noted that there are iterations in both of the two steps hard to parallelize. The proposed approach in this work updates LSI by merging the partial singular value decomposition (PSVD) of incremental documents into the main PSVD. Compared with Zha's method, the proposed algorithm performs the QR decomposition on a thinner matrix and the intermediate SVD on a s-bordered diagonal matrix. Moreover, the proposed algorithm can be parallelized on computing PSVDs of multiple new document sets and merging them into the main PSVD. The theoretical justification for using the PSVD of incremental documents in the updating process is also attached via supplement. Finally, the experiments via comparing the existing methods and the proposed LSI updating method are provided for illustrating the acceleration and accuracy. Furthermore, the proposed algorithm could even be used as a universal algorithm for merging PSVDs.