全文检索--ES--IK分词插件(四)

一、IK分词插件

ElasticSearch 默认采用分词器, 单个字分词 ,效果很差。
  Elasticsearch-analysis-ik,这是一个将Lucence IK分词器集成到elasticsearch的ik分词器插件,并且支持自定义的词典。
地址:https://github.com/medcl/elasticsearch-analysis-ik
releases 地址:
https://github.com/medcl/elasticsearch-analysis-ik/releases?after=v6.6.2

  1. 安装
    elasticsearch7.3版本已经不需要额外安装中文分词插件。
# 安装插件
[root@localhost elasticsearch-6.4.3]# ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.4.3/elasticsearch-analysis-ik-6.4.3.zip
-> Downloading https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.4.3/elasticsearch-analysis-ik-6.4.3.zip
[=================================================] 100%   
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@     WARNING: plugin requires additional permissions     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.net.SocketPermission * connect,resolve
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]y
-> Installed analysis-ik
# 安装完成
[root@localhost ~]# ls /usr/local/elasticsearch-6.4.3/plugins/
analysis-ik
# 修改配置
[es@localhost elasticsearch-6.4.3]$ vi config/elasticsearch.yml
# 添加内容

# 重启elasticsearch
安装成功

注意:中文词分版本与ES版本必须一致。

  1. java.net.SocketPermission错误解决
# 在config目录中创建文件socketPolicy.policy
[root@localhost config]# vi socketPolicy.policy
# 内容
grant {
    permission java.net.SocketPermission "*:*","connect,resolve";
};

# 在config目录 jvm.option 文件最后添加
[root@localhost elasticsearch-6.4.3]# vi config/jvm.options 
-Djava.security.policy=/usr/local/elasticsearch-6.4.3/config/socketPolicy.policy

  1. 测试IK分词
    两种analyzer,一般是选用ik_max_word
    ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合;
    ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。

elasticsearch7.3版本已经不需要额外安装中文分词插件。


分词测试

分词测试
GET /test_index/_analyze 
{
    "text": "安徽高校新设殡葬专业上热搜 校方称学生刚入学就被 预定",
    "analyzer": "ik_max_word"
}

结果:

{
  "tokens": [
    {
      "token": "安徽",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "高校",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "新设",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "殡葬",
      "start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "专业",
      "start_offset": 8,
      "end_offset": 10,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "上",
      "start_offset": 10,
      "end_offset": 11,
      "type": "CN_CHAR",
      "position": 5
    },
    {
      "token": "热",
      "start_offset": 11,
      "end_offset": 12,
      "type": "CN_CHAR",
      "position": 6
    },
    {
      "token": "搜",
      "start_offset": 12,
      "end_offset": 13,
      "type": "CN_CHAR",
      "position": 7
    },
    {
      "token": "校方",
      "start_offset": 14,
      "end_offset": 16,
      "type": "CN_WORD",
      "position": 8
    },
    {
      "token": "称",
      "start_offset": 16,
      "end_offset": 17,
      "type": "CN_CHAR",
      "position": 9
    },
    {
      "token": "学生",
      "start_offset": 17,
      "end_offset": 19,
      "type": "CN_WORD",
      "position": 10
    },
    {
      "token": "刚",
      "start_offset": 19,
      "end_offset": 20,
      "type": "CN_CHAR",
      "position": 11
    },
    {
      "token": "入学",
      "start_offset": 20,
      "end_offset": 22,
      "type": "CN_WORD",
      "position": 12
    },
    {
      "token": "就被",
      "start_offset": 22,
      "end_offset": 24,
      "type": "CN_WORD",
      "position": 13
    },
    {
      "token": "预定",
      "start_offset": 25,
      "end_offset": 27,
      "type": "CN_WORD",
      "position": 14
    }
  ]
}
  1. ES中文分词示例
    英文分词:"analyzer": "english"
    中文分词:"analyzer": "ik_max_word"

创建一个test_index索引,类型为:test_type,包含两个字段id与content,其中对content进行分词。

PUT  /test_index
{
    "mappings": {
        "test_type": {
            "properties": {
                "id": {
                    "type": "long"
                },
                "content": {
                    "type": "text",
                    "analyzer": "ik_max_word",
                    "search_analyzer": "ik_smart"
                }
            }
        }
    }
}

添加数据:

POST /test_index/test_type/_bulk
{"index":{}}
{"id":1,"content":"美国留给伊拉克的是个烂摊子吗"}
{"index":{}}
{"id":2,"content":"公安部:各地校车将享最高路权"}
{"index":{}}
{"id":3,"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}
{"index":{}}
{"id":4,"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}

查询数据:

GET /test_index/_search
{
  "query" : { "match" : { "content" : "韩警" }}
}

结果:

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.6462245,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "hwVVNG4B1-XKoZdra0Jm",
        "_score": 1.6462245,
        "_source": {
          "id": 3,
          "content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
        }
      }
    ]
  }
}

高亮查询

GET /test_index/_search
{
    "query" : { "match" : { "content" : "美国" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}

结果:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "hQVVNG4B1-XKoZdra0Jm",
        "_score": 0.2876821,
        "_source": {
          "id": 1,
          "content": "美国留给伊拉克的是个烂摊子吗"
        },
        "highlight": {
          "content": [
            "<tag1>美国</tag1>留给伊拉克的是个烂摊子吗"
          ]
        }
      }
    ]
  }
}

二、拼音插件pinyin

地址:https://github.com/medcl/elasticsearch-analysis-pinyin/releases
源码地址:https://github.com/medcl/elasticsearch-analysis-pinyin#pinyin-analysis-for-elasticsearch

  1. 安装
[root@localhost elasticsearch-6.4.3]$ ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v6.4.3/elasticsearch-analysis-pinyin-6.4.3.zip
-> Downloading https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v6.4.3/elasticsearch-analysis-pinyin-6.4.3.zip
[=================================================] 100%   
-> Installed analysis-pinyin
# 插件查看
[root@localhost elasticsearch-6.4.3]$ ls plugins/
analysis-ik  analysis-pinyin
[root@localhost elasticsearch-6.4.3]$ ls plugins/analysis-pinyin/
elasticsearch-analysis-pinyin-6.4.3.jar  nlp-lang-1.7.jar  plugin-descriptor.properties

# 重启ES
[es@localhost elasticsearch-6.4.3]$ ./bin/elasticsearch

  1. 自定义拼音分析器
PUT /medcl/ 
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true
                }
            }
        }
    }
}
  1. 分析测试
GET /medcl/_analyze
{
  "text": ["刘德华"],
  "analyzer": "pinyin_analyzer"
}

结果:

{
  "tokens": [
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "刘德华",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "ldh",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "de",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "hua",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    }
  ]
}
  1. 示例
    1)settings设置
PUT /pinyin_index/ 
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true
                }
            }
        }
    }
}

2)创建mapping
索引:pinyin_index
类型:test_type
字段:name

POST /pinyin_index/test_type/_mapping 
{
    "test_type": {
        "properties": {
            "name": {
                "type": "text",
                "analyzer": "ik_max_word",
                "copy_to": true,
                "fields": {
                    "pinyin": {
                        "type": "text",
                        "analyzer": "pinyin_analyzer"
                    }
                }
            }
        }
    }
}

3)添加文档内容

POST /pinyin_index/test_type/1
{"name":"刘德华"}

POST /pinyin_index/test_type/2
{"name":"中华人民共和国国歌"}

POST /pinyin_index/_search
# 结果
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "pinyin_index",
        "_type": "test_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "name": "中华人民共和国国歌"
        }
      },
      {
        "_index": "pinyin_index",
        "_type": "test_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "name": "刘德华"
        }
      }
    ]
  }
}

4)测试

POST /pinyin_index/test_type/_search?q=name.pinyin:liu
# 结果
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.3439677,
    "hits": [
      {
        "_index": "pinyin_index",
        "_type": "test_type",
        "_id": "1",
        "_score": 0.3439677,
        "_source": {
          "name": "刘德华"
        }
      }
    ]
  }
}

POST /pinyin_index/test_type/_search?q=name.pinyin:zhong
POST /pinyin_index/test_type/_search?q=name.pinyin:de
POST /pinyin_index/test_type/_search?q=name.pinyin:gouge
  1. IK+pinyin分词配置
    1)settings设置
PUT /pinyin_index/ 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "ik_pinyin_analyzer": {
                    "type": "custom",
                    "tokenizer": "ik_smart",
                    "filter": ["my_pinyin", "word_delimiter"]
                }
            },
            "filter": {
                "my_pinyin": {
                    "type": "pinyin",
                    "first_letter": "prefix",
                    "padding_char": " "
                }
            }
        }
    }
}
  1. 创建mapping
POST /pinyin_index/test_type/_mapping 
{
    "test_type": {
        "properties": {
            "name": {
                "type": "text",
                "fields": {
                    "pinyin": {
                        "type": "text",
                        "store": "true",
                        "term_vector": "with_positions_offsets",
                        "analyzer": "ik_pinyin_analyzer",
                        "boost": 10
                    }
                }
            }
        }
    }
}

3) 添加文档内容

POST /pinyin_index/test_type/1
{"name":"刘德华"}

POST /pinyin_index/test_type/2
{"name":"中华人民共和国国歌"}
  1. 查询测试
POST /pinyin_index/test_type/_search?q=name.pinyin:gouge

GET /pinyin_index/test_type/_search
{
  "query": {
    "match": {
      "name.pinyin": "国歌"
    }
  },
  "highlight": {
    "fields": {
      "name.pinyin": {}
    }
  }
}

GET /pinyin_index/test_type/_search
{
  "query": {
    "match": {
      "name.pinyin": "zhong"
    }
  },
  "highlight": {
    "fields": {
      "name.pinyin": {}
    }
  }
}
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,874评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,102评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,676评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,911评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,937评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,935评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,860评论 3 416
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,660评论 0 271
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,113评论 1 308
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,363评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,506评论 1 346
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,238评论 5 341
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,861评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,486评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,674评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,513评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,426评论 2 352

推荐阅读更多精彩内容