DataparkSearch Engine 4.28 reference manual: The Web searching software | ||
---|---|---|
Prev | Chapter 7. Languages support | Next |
Chinese, Japanese, Korean and Thai writings have no spaces between words in phrase as in western languages. Thus, while indexing documents in these languages, it's need additionally to segment phrases into words.
For Japanese language phrase segmenting the one of ChaSen, a morphological system for Japanese language, or MeCab, a Japanese morphological analyser, is used. Thus, you need one of these systems to be installed before DataparkSearch's configuring and building.
To enable Japanese language phrase segmenting use --enable-chasen or --enable-mecab switch for configure.
For Chinese language phrase segmenting the frequency dictionary of Chinese words is used. And segmenting itself is done by dynamic programming method to maximize the cumulative frequency of produced words.
To enable Chinese language phrase segmenting it's need to enable the support for Chinese charsets while DataparkSearch configuring, and specify the frequency dictionary of Chinese words by LoadChineseList command in indexer.conf file.
LoadChineseList [charset dictionaryfilename]
By default, the GB2312charset and mandarin.freqdictionary is used.
For Thai language phrase segmenting the frequency dictionary of Thai words is used. And segmenting itself is done as for Chinese language.
To enable Thai language phrase segmenting it's need to specify the frequency dictionary of Thai words by LoadThaiList command in indexer.conf file.
LoadThaiList [charset dictionaryfilename]
By default, the tis-620charset and thai.freqdictionary is used.
For Korean language phrase segmenting the frequency dictionary of Korean words is used. And segmenting itself is done as for Chinese language.
To enable Korean language phrase segmenting it's need to specify the frequency dictionary of Korean words by LoadKoreanList command in indexer.conf file.
LoadKoreanList [charset dictionaryfilename]
By default, the euc-krcharset and korean.freqdictionary is used.