Journal of Beijing University of Posts and Telecommunications

  • EI核心期刊

JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM ›› 2009, Vol. 32 ›› Issue (5): 10-14.doi: 10.13190/jbupt.200905.10.wenj

• Papers • Previous Articles     Next Articles

Chinese Frequent String Extraction and Application on Language Model

WEN Juan,WANG Xiao-jie   

  1. (Research Center of Intelligence Science and Technology, Beijing Univer
    sity of Posts and Telecommunications, Beijing 100876, China)
  • Received:2009-02-15 Revised:2009-06-30 Online:2009-10-28 Published:2009-10-28
  • Contact: WEN Juan

Abstract:

In order to extract the Chinese frequent strings (CFS) accurately and make better use in language models, a new method for CFS extraction using string segmentation degree is proposed. Unigram and bigram language models based on this CFS extraction method are built. Experiment shows that the CFS based language model can deal with the lack of long distance dependency problem in character and word based language model. It also shows that the CFS based language model has lower model perplexity and higher pinyintocharacter conversion correctness compared with the model based on previous CFS extraction method.

Key words: Chinese frequent string, character distinction degree, string segmentation degree, n-gram language model, pinyin-to-character conversion