Abstract:
With the prevailing trend of packaging industry, there exist diversity in the naming conventions of packaging products. So, the named entity recognition (NER) of these products is becoming necessity for packaging information extraction. Statistically speaking, Chinese product names are characterized by complex composition and long length, which makes the product names more complex and difficult to recognize in textual corpora. By analyzing current algorithms of NER, this paper proposes a Chinese packaging products NER method via using bidirectional GRU-CRF model. GRU (Gated recurrent unit) is an improved structure of hidden layer nodes in recurrent neural network (RNN). In this proposed model, a bidirectional GRU network is used to store and represent contextual semantic information of word, while CRF is responsible for modeling the probability of transition within output word label sequence. From packaging vertical website, we gather textual documents, such as news report, announcements and regulations, and obtain word vectors as pre-trained distributed representation of domain glossary. After automatic labeling of product entities in text data, word sequences are sent to the model in the form of vector. Thus, the best labeling sequence is generated to highlight product entities in the sentence. Finally, the proposed model using Chinese packaging corpus is compared with other classical models and state-of-the-art RNN models. It is shown from the simulation result that the proposed in this paper can achieve a precision rate of 82.45% and a recall rate of 80.38%. By conducting another series of contrast experiments on different length of input vectors selected in the forms of both word-level representation and char-level representation, it is found that the word-level representation fits better on the corpus and model used here. Meanwhile, this method can achieve less artificial feature engineering work than traditional machine learning models, such as CRF, HMM etc. Hence, the bidirectional CRF-GRU method with word-level distributed representation is more suitable for Chinese packaging product recognition task.