1、创建索引,并设计mapping
全拼和首拼需要分两个字段。一开始想要用一个字段解决,结果怎么弄都无法满足需求。
PUT aikg_test
{
"mappings": {
"properties": {
"name": {
"type": "keyword",
"fields": {
"full_pinyin": {
"type": "text",
"store": false,
"term_vector": "with_offsets",
"analyzer": "full_pinyin_analyzer",
"boost": 10
},
"first_pinyin": {
"type": "text",
"store": false,
"term_vector": "with_offsets",
"analyzer": "first_pinyin_analyzer",
"boost": 10
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"first_pinyin_analyzer": {
"tokenizer": "first_pinyin_letter"
},
"full_pinyin_analyzer": {
"tokenizer": "full_pinyin_letter"
}
},
"tokenizer": {
"first_pinyin_letter": {
"type": "pinyin",
"keep_first_letter": true,
"keep_full_pinyin": false,
"keep_none_chinese": false,
"keep_none_chinese_in_first_letter": true,
"none_chinese_pinyin_tokenize": false
},
"full_pinyin_letter": {
"type": "pinyin",
"keep_first_letter": false,
"keep_full_pinyin": false,
"keep_none_chinese": true,
"keep_none_chinese_in_first_letter": false,
"none_chinese_pinyin_tokenize": false,
"keep_joined_full_pinyin": true,
"keep_none_chinese_in_joined_full_pinyin": true
}
}
}
}
}
2、分词例子
2.1、全拼分词
具体参数设置:
"full_pinyin_letter": {
"type": "pinyin",
"keep_first_letter": false,
"keep_full_pinyin": false,
"keep_none_chinese": true,
"keep_none_chinese_in_first_letter": false,
"none_chinese_pinyin_tokenize": false,
"keep_joined_full_pinyin": true,
"keep_none_chinese_in_joined_full_pinyin": true
}
分词:
GET aikg_test/_analyze
{
"text": ["刘德华at2016"],
"analyzer": "full_pinyin_analyzer"
}
分词结果:
关键参数:"keep_joined_full_pinyin": true
和 "keep_none_chinese_in_joined_full_pinyin": true
,前者保证汉字全拼连接在一起,后者保证汉字全拼和其他字符连在一起。注意参数:"keep_full_pinyin": false
。
2.2、首拼分词
具体参数设置:
"first_pinyin_letter": {
"type": "pinyin",
"keep_first_letter": true,
"keep_full_pinyin": false,
"keep_none_chinese": true,
"keep_none_chinese_in_first_letter": true,
"none_chinese_pinyin_tokenize": false
}
分词:
GET aikg_test/_analyze
{
"text": ["刘德华at2016"],
"analyzer": "first_pinyin_analyzer"
}
分词结果:
关键参数:"keep_none_chinese": false
,如果该值设置为 true,“刘德华at2016”会拆分为两个词,其中非中文会分成一个词。这种情况下输入 at 前缀匹配,会查询到该词,而实际上该词并不是以 at 开头。分词结果如下图:
当设置参数"keep_none_chinese_in_first_letter": true
,就会把汉字首拼和其他字符连接在一起。
3、大小写问题
当参数为大写“LDH”时,无法匹配到刘德华。解决方法很简单,在程序里把参数统一转为小写。