ElasticSearch分词器

[TOC]

1 内置分词器

es中有很多内置分词器，如果不特殊指定分词器，默认分词器为standard。对英文单词可以正常分词，对中文分词不友好，会将中文分词为一个个单个字符。

1.1 默认分词器

es默认分词器standard，中英分分词结果如下

1.1.1对英文分词

POST _analyze
{
  "analyzer": "standard",
  "text":"hello world"
}

分词结果如下：

{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

可见hello world被分词为hello和world，是期望的分词结果。

1.1.2 对中文分词

POST _analyze
{
  "analyzer": "standard",
  "text": "少年包青天"
}

分词结果如下：

{
  "tokens": [
    {
      "token": "少",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "年",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "包",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "青",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "天",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    }
  ]
}

可见es内置的分词器，将少年包青天分词为单个字符，并非我们期望的分词结果（分词为少年和包青天）。

1.2 其他内置分词器

待完善

2 中文分词器

中文分词器使用较广的为IK分词器，需要进行安装。

2.1 安装

IK分词器的版本，需和ES版本一致。官网地址： https://github.com/medcl/elasticsearch-analysis-ik

2.1.1 在线安装

（1）安装

/opt/elasticsearch-6.8.0/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.8.0/elasticsearch-analysis-ik-6.8.0.zip

（2）重启es

2.1.4 本地安装

（1）下载文件

cd /opt
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.8.0/elasticsearch-analysis-ik-6.8.0.zip
#下载后文件目录为/opt/elasticsearch-analysis-ik-6.8.0.zip

（2）es插件目录下创建文件夹ik

mkdir /opt/elasticsearch-6.8.0/plugins/ik

（3）解压文件到ik文件夹

unzip /opt/elasticsearch-analysis-ik-6.8.0.zip -d /opt/elasticsearch-6.8.0/plugins/ik

（4）重启es

2.2 IK分词

IK有两种分词规则ik_smart和 ik_max_word。

ik_smart采用最粗粒度的分词，ik_max_word为最细粒度的分词。

一般都采用ik_smart进行分词。

2.2.1 对英文分词

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "Hello world"
}

分词结果如下：

{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "ENGLISH",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "ENGLISH",
      "position": 1
    }
  ]
}

可见hello world同样被分词为hello和world，对英文分词同样是期望的分词结果。

2.2.2 粗粒度中文分词

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "少年包青天"
}

分词结果如下：

{
  "tokens": [
    {
      "token": "少年",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "包青天",
      "start_offset": 2,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

当前分词结果，我们期望的分词结果（分词为少年和包青天）。

2.2.3 细粒度中文分词

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "少年包青天"
}

分词结果如下：

{
  "tokens": [
    {
      "token": "少年",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "包青天",
      "start_offset": 2,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "青天",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 2
    }
  ]
}

细粒度分词器ik_max_word，尽可能地第分词结果分更多的结果。

2.3 查看数据的分词结果

如es中已存储如下数据，数据类型为text，未指定分词器。

{
  "_index": "test_index",
  "_type": "main",
  "_id": "999",
  "_version": 2,
  "found": true,
  "_source": {
    "remark": "少年包青天"
  }
}

可通过如下命令，查看es对当前数据的分词结果：

GET /test_index/main/999/_termvectors
{
  "fields": ["remark"]
}

因为未置顶分词器，采用了默认的standard分词器。分词结果如下：

{
  "_index": "test_index",
  "_type": "main",
  "_id": "999",
  "_version": 2,
  "found": true,
  "took": 40,
  "term_vectors": {
    "remark": {
      "field_statistics": {
        "sum_doc_freq": 47,
        "doc_count": 7,
        "sum_ttf": 47
      },
      "terms": {
        "包": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 2,
              "start_offset": 2,
              "end_offset": 3
            }
          ]
        },
        "天": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 4,
              "start_offset": 4,
              "end_offset": 5
            }
          ]
        },
        "少": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 1
            }
          ]
        },
        "年": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 1,
              "end_offset": 2
            }
          ]
        },
        "青": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 3,
              "start_offset": 3,
              "end_offset": 4
            }
          ]
        }
      }
    }
  }
}

2.4 设置索引默认分词器

索引默认分词器为standard。创建索引时，可更改索引的分词器，索引中映射如不指定分词器，均会采用更改后的分词器。

PUT /test_ik
{
  "settings": {
    "index.analysis.analyzer.default.type": "ik_smart"
  }
}

2.5 设置映射分词器

索引手动创建映射时，可指定此属性对应的分词器。指定的分词器优先级高于索引的分词器。

PUT /test_ik/main/_mapping
{
  "properties": {
    "name":{
      "type": "text",
      "analyzer": "ik_smart"
    }
  }
}

3 自定义分词规则

3.1 配置文件

IK分词器，配置文件路径为：

/opt/elasticsearch-6.8.0/plugins/ik/config/IKAnalyzer.cfg.xml

此文件内容如下：

<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict"></entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

扩展字典指的是自定义的分词内容，在其中定制自己的词语。如设定“少年包青天”为一个词。

扩展停止词字典指的排除的不定义为词语的无效字符，如原有的分词器中的“的”，“是”等词语就不会生成分词结果。

参数配置中的路径，可以为绝对路径，也可以为相对路径，多个文件用英文;分开。

3.2 扩展字典

创建文件

#创建目录
mkdir /opt/elasticsearch-6.8.0/plugins/ik/config/myconf
#创建文件
vim ext_dict1.dic

文件内容如下：

少年包

更改配置文件IKAnalyzer.cfg.xml

<entry key="ext_dict">myconf/ext_dict1.dic</entry>

重启es

注意，更新完毕后，重启es才会生效

测试分词

（1）测试1 [ik_smart]

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "少年包青天"
}

结果如下：

{
  "tokens": [
    {
      "token": "少年",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "包青天",
      "start_offset": 2,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

（2）测试2 [ik_smart]

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "少年包测试"
}

结果如下：

{
  "tokens": [
    {
      "token": "少年包",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "测试",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

（3）测试3 [ik_max_word]

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "少年包青天"
}

结果如下：

{
  "tokens": [
    {
      "token": "少年包",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "少年",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "包青天",
      "start_offset": 2,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "青天",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}

结论：

根据（1）（2）结果，粗粒度分词器ik_smart中，优先IK自带的分词器，自带的分词器无法分词后，再使用自定义的分词器。

根据（3）结果，细粒度分词器ik_max_word，IK自带的分词和自定义的分词器，一同起作用。

3.3 热更新 IK分词

热更新IK分词，需要远程应用支持，不需要重启ES。

通过更改配置文件IKAnalyzer.cfg.xml的以下属性，来实现热更新IK分词。

<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">words_location</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<entry key="remote_ext_stopwords">words_location</entry>

以下部分来自官网：

其中 location 是指一个 url，比如 http://yoursite.com/getCustomDict，该请求只需满足以下两点即可完成分词热更新。

该 http 请求需要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这两者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。

该 http 请求返回的内容格式是一行一个分词，换行符用 \n 即可。

满足上面两点要求就可以实现热更新分词了，不需要重启 ES 实例。

可以将需自动更新的热词放在一个 UTF-8 编码的 .txt 文件里，放在 nginx 或其他简易 http server 下，当 .txt 文件修改时，http server 会在客户端请求该文件时自动返回相应的 Last-Modified 和 ETag。可以另外做一个工具来从业务系统提取相关词汇，并更新这个 .txt 文件。

4 更改映射分词器

4.1 查看索引的分词器

GET /test_index/_mapping

结果如下：

{
  "test_index": {
    "mappings": {
      "main": {
          "remark": {
            "type": "text"
          }
      }
    }
  }
}

默认情况下，设置mapping时，如果字段类型为text，不指定分词器，则为默认的standard分词器，中文分词会有问题。

此时索引已经开始使用，需要更改索引映射的分词器为ik_smart。

4.2 更改步骤

使用ES时，最容易出现的情况，就是mapping的分词器创建错误，需要更改索引的mapping。

ES中，索引的mapping只能新增字段，不可更改或删除字段。如想对索引mapping进行变更，只能重新创建新的索引，再将原来索引中的数据reIndex拷贝到新索引中，再删除原有的索引。
其中面临的问题主要是索引名字发生了变更，如在生产中，需要更改代码里索引的名字为新名字，需要停掉服务。ES中的alias可以解决此问题，在ES中为每个索引都创建别名，当索引发生变更时，reIndex数据到新索引后，更改别名指向新的索引，即可实现不停机更改ES的mapping。

4.2.1 变更要求

如原索引映射创建脚本为：

#创建索引
PUT /test_re
#创建索引映射
PUT /test_re/_mapping/myType
{
  "properties": {
    "name":{
      "type": "text"
    }
  }
}

查看索引映射信息如下：

{
  "test_re": {
    "mappings": {
      "myType": {
        "properties": {
          "name": {
            "type": "text"
          }
        }
      }
    }
  }
}

可以看出当前映射中的字段name的类型为text，默认的分词器为standard，此时需要将分词器更改为ik_smart。

4.2.2 为索引创建别名

需要在上线前就为索引创建别名，用于以后的索引热更新使用。

以下为索引test_re创建别名test_re_alias：

POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "test_re",
        "alias": "test_re_alias"
      }
    }
  ]
}

查看新创建的索引信息：

GET /test_re_alias

结果如下，可以看到别名test_re_alias已和真正的索引test_re建立关联关系：

{
  "test_re": {
    "aliases": {
      "test_re_alias": {}
    },
    "mappings": {
      "myType": {
        "properties": {
          "name": {
            "type": "text"
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1586697159399",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "juxDk8TmQt-6gdu6iviVGg",
        "version": {
          "created": "6040099"
        },
        "provided_name": "test_re"
      }
    }
  }
}

4.2.3 创建新索引

#创建索引
PUT /test_re1
#创建索引映射
PUT /test_re1/_mapping/myType
{
  "properties": {
    "name":{
      "type": "text",
      "analyzer": "ik_max_word"
    }
  }
}

新创建的索引结构如下：

{
  "test_re1": {
    "aliases": {},
    "mappings": {
      "myType": {
        "properties": {
          "name": {
            "type": "text",
            "analyzer": "ik_max_word"
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1586697347875",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "h2Hk4mkRTyyaHuK0PKecUQ",
        "version": {
          "created": "6040099"
        },
        "provided_name": "test_re1"
      }
    }
  }
}

4.2.4 重建索引

重建索引，即同步数据，将原有索引中的内容，写入到新的索引中，

POST /_reindex
{
  "source": {
    "index": "test_re"
  },
  "dest": {
    "index": "test_re1"
  }
}

执行结果如下：

{
  "took": 182,
  "timed_out": false,
  "total": 4,
  "updated": 0,
  "created": 4,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1,
  "throttled_until_millis": 0,
  "failures": []
}

4.2.5 别名指向新索引

创建新索引，必须删除原有别名指向，否则，用别名查询，会查询出两个索引的所有内容。

删除原有别名指向

POST /_aliases
{
  "actions": [
    {
      "remove": {
        "index": "test_re",
        "alias": "test_re_alias"
      }
    }
  ]
}

创建新的别名

POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "test_re1",
        "alias": "test_re_alias"
      }
    }
  ]
}

查看别名信息：

GET /test_re_alias

结果如下，可见别名已指向了新的索引，并且对应了新的分词器ik_max_word：

{
  "test_re1": {
    "aliases": {
      "test_re_alias": {}
    },
    "mappings": {
      "myType": {
        "properties": {
          "name": {
            "type": "text",
            "analyzer": "ik_max_word"
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1586697347875",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "h2Hk4mkRTyyaHuK0PKecUQ",
        "version": {
          "created": "6040099"
        },
        "provided_name": "test_re1"
      }
    }
  }
}

4.2.6 删除原有索引

DEL /test_re

4.3 问题

4.3.1 重建索引过程中的数据一致性

重建索引过程和别名指向新索引的操作过程中，如果ES中有新的数据更改。如何保证此部分数据在新索引中的数据一致性？