es 自动补全

技术文档

安装拼音 分词器

选择es版本对应的pinyin分词器版本

下载后解压，放到es的插件目录下

重启es

自定义分词器

拼音分词器——可选配置

1. 首字母处理配置

keep_first_letter (默认: true)

解释：是否提取每个汉字的首字母组合，用于支持首字母缩写搜索
开启时：刘德华 → [ldh]
关闭时：刘德华 → []（不生成首字母）
应用场景：适用于\"ldh\"搜索\"刘德华\"的需求

keep_separate_first_letter (默认: false)

解释：是否将每个汉字的首字母分开存储
开启时：刘德华 → [l,d,h]
关闭时：刘德华 → [ldh]
注意：开启会增加索引体积，但能支持更灵活的搜索（如\"l d h\"）

limit_first_letter_length (默认: 16)

解释：限制首字母结果的最大长度
示例：
中华人民共和国 → 默认输出[zhrmghg]（7字符）
设置为3时 → [zhr]
用途：控制长文本的首字母结果长度

2. 完整拼音处理

keep_full_pinyin (默认: true)

解释：是否保留每个汉字的完整拼音
开启时：刘德华 → [liu,de,hua]
关闭时：刘德华 → []
必要性：支持拼音精确搜索的基础配置

keep_joined_full_pinyin (默认: false)

解释：是否将完整拼音连接成一个词
开启时：刘德华 → [liudehua]
关闭时：刘德华 → [liu,de,hua]
优劣：连接后减少索引词项，但会丢失单字搜索能力

3. 非中文处理配置

keep_none_chinese (默认: true)

解释：是否保留原始文本中的非中文字符
开启时：刘德华AT2016 → [liu,de,hua,AT2016]
关闭时：刘德华AT2016 → [liu,de,hua]
重要性：处理混合文本的关键参数

keep_none_chinese_together (默认: true)

解释：是否保持非中文连续字符的完整性
开启时：DJ音乐家 → [DJ,yin,yue,jia]
关闭时：DJ音乐家 → [D,J,yin,yue,jia]
影响：关闭后会显著增加索引词项数量

4. 高级处理配置

none_chinese_pinyin_tokenize (默认: true)

解释：是否将非中文按拼音规则拆分
开启时：liudehua2016 → [liu,de,hua,2,0,1,6]
关闭时：liudehua2016 → [liudehua2016]
特殊用途：处理拼音与数字混合的情况

remove_duplicated_term (默认: false)

解释：是否去除重复的词项
开启时：de的 → [de]
关闭时：de的 → [de,的]
权衡：节省30-50%索引空间，但影响高亮精度

keep_original (默认：false)

解释：是否保留原始的文本
开启时：\"北京\" → [\"北京\", \"beijing\", \"bj\"]
关闭时：\"北京\"→ [\"beijing\", \"bj\"]

5. 系统行为配置

ignore_pinyin_offset (默认: true)

解释：是否忽略拼音分词的位置偏移
开启时：允许重叠分词（节省资源）
关闭时：严格位置约束（保证高亮准确）
版本注意：Elasticsearch 6.0+必须关注此参数

自定义分词器的工作原理

elasticsearch中分词器（analyzer）的组成包含三部分:

character filter：在tokenizer之前对文本进行处理。例如删除字符、替换字符
tokenizer：将文本按照一定的规则切割成词条（term）。例如keyword,就是不分词;还有ik_smart
tokenizer filter：将tokenizer输出的词条做进一步处理。例如大小写转换、同义词处理、拼音处理等

案例

新建用于测试自定义分词器的索引库test

PUT /test{ \"settings\": { \"analysis\": { \"analyzer\": { \"my_analyzer\": { \"tokenizer\": \"ik_max_word\", \"filter\": \"py\" } }, \"filter\": { \"py\": { \"type\": \"pinyin\", \"keep_full_pinyin\": false, # 不保留每个汉字的完整拼音 \"keep_joined_full_pinyin\": true, # 把完整的拼音连成一个长拼音 \"keep_original\": true, # 保留原始的文本 \"limit_first_letter_length\": 16, # 限制首字母的最大长度为16 \"remove_duplicated_term\": true, # 去除重复的选项 \"none_chinese_pinyin_tokenize\": false # 不将非中文按拼音规则拆分 } } } }, \"mappings\": { \"properties\": { \"words\": { \"type\": \"text\", \"analyzer\": \"my_analyzer\", \"search_analyzer\": \"ik_max_word\" } } }}

创建倒排索引的时候使用 my_analyzer

查询的时候指定分词器为 ik_max_word

这样就不会出现查询\"狮子\"的时候，出现虱子有关的词条了

测试

POST /test/_analyze{ \"text\": [\"了却君王天下事junwang天下事\"], \"analyzer\": \"my_analyzer\"}

{ \"tokens\" : [ { \"token\" : \"了却\", \"start_offset\" : 0, \"end_offset\" : 2, \"type\" : \"CN_WORD\", \"position\" : 0 }, { \"token\" : \"leque\", \"start_offset\" : 0, \"end_offset\" : 2, \"type\" : \"CN_WORD\", \"position\" : 0 }, { \"token\" : \"lq\", \"start_offset\" : 0, \"end_offset\" : 2, \"type\" : \"CN_WORD\", \"position\" : 0 }, { \"token\" : \"君王\", \"start_offset\" : 2, \"end_offset\" : 4, \"type\" : \"CN_WORD\", \"position\" : 1 }, { \"token\" : \"junwang\", \"start_offset\" : 2, \"end_offset\" : 4, \"type\" : \"CN_WORD\", \"position\" : 1 }, { \"token\" : \"jw\", \"start_offset\" : 2, \"end_offset\" : 4, \"type\" : \"CN_WORD\", \"position\" : 1 }, { \"token\" : \"天下事\", \"start_offset\" : 4, \"end_offset\" : 7, \"type\" : \"CN_WORD\", \"position\" : 2 }, { \"token\" : \"tianxiashi\", \"start_offset\" : 4, \"end_offset\" : 7, \"type\" : \"CN_WORD\", \"position\" : 2 }, { \"token\" : \"txs\", \"start_offset\" : 4, \"end_offset\" : 7, \"type\" : \"CN_WORD\", \"position\" : 2 }, { \"token\" : \"天下\", \"start_offset\" : 4, \"end_offset\" : 6, \"type\" : \"CN_WORD\", \"position\" : 3 }, { \"token\" : \"tianxia\", \"start_offset\" : 4, \"end_offset\" : 6, \"type\" : \"CN_WORD\", \"position\" : 3 }, { \"token\" : \"tx\", \"start_offset\" : 4, \"end_offset\" : 6, \"type\" : \"CN_WORD\", \"position\" : 3 }, { \"token\" : \"事\", \"start_offset\" : 6, \"end_offset\" : 7, \"type\" : \"CN_CHAR\", \"position\" : 4 }, { \"token\" : \"shi\", \"start_offset\" : 6, \"end_offset\" : 7, \"type\" : \"CN_CHAR\", \"position\" : 4 }, { \"token\" : \"s\", \"start_offset\" : 6, \"end_offset\" : 7, \"type\" : \"CN_CHAR\", \"position\" : 4 }, { \"token\" : \"junwang\", \"start_offset\" : 7, \"end_offset\" : 14, \"type\" : \"ENGLISH\", \"position\" : 5 }, { \"token\" : \"天下事\", \"start_offset\" : 14, \"end_offset\" : 17, \"type\" : \"CN_WORD\", \"position\" : 6 }, { \"token\" : \"tianxiashi\", \"start_offset\" : 14, \"end_offset\" : 17, \"type\" : \"CN_WORD\", \"position\" : 6 }, { \"token\" : \"txs\", \"start_offset\" : 14, \"end_offset\" : 17, \"type\" : \"CN_WORD\", \"position\" : 6 }, { \"token\" : \"天下\", \"start_offset\" : 14, \"end_offset\" : 16, \"type\" : \"CN_WORD\", \"position\" : 7 }, { \"token\" : \"tianxia\", \"start_offset\" : 14, \"end_offset\" : 16, \"type\" : \"CN_WORD\", \"position\" : 7 }, { \"token\" : \"tx\", \"start_offset\" : 14, \"end_offset\" : 16, \"type\" : \"CN_WORD\", \"position\" : 7 }, { \"token\" : \"事\", \"start_offset\" : 16, \"end_offset\" : 17, \"type\" : \"CN_CHAR\", \"position\" : 8 }, { \"token\" : \"shi\", \"start_offset\" : 16, \"end_offset\" : 17, \"type\" : \"CN_CHAR\", \"position\" : 8 }, { \"token\" : \"s\", \"start_offset\" : 16, \"end_offset\" : 17, \"type\" : \"CN_CHAR\", \"position\" : 8 } ]}

PUT /test/_doc/1{ \"words\":\"身上有虱子\"}PUT /test/_doc/2{ \"words\":\"山里有狮子\"}

执行DSL

GET /test/_search{ \"query\": { \"match\": { \"words\": \"虱子\" } }}

指定search_analyzer为ik_max_word前的结果

{ \"took\" : 6, \"timed_out\" : false, \"_shards\" : { \"total\" : 1, \"successful\" : 1, \"skipped\" : 0, \"failed\" : 0 }, \"hits\" : { \"total\" : { \"value\" : 2, \"relation\" : \"eq\" }, \"max_score\" : 0.33425623, \"hits\" : [ { \"_index\" : \"test\", \"_type\" : \"_doc\", \"_id\" : \"1\", \"_score\" : 0.33425623, \"_source\" : { \"words\" : \"身上有虱子\" } }, { \"_index\" : \"test\", \"_type\" : \"_doc\", \"_id\" : \"2\", \"_score\" : 0.3085442, \"_source\" : { \"words\" : \"山里有狮子\" } } ] }}

指定search_analyzer为ik_max_word后的结果

{ \"took\" : 2, \"timed_out\" : false, \"_shards\" : { \"total\" : 1, \"successful\" : 1, \"skipped\" : 0, \"failed\" : 0 }, \"hits\" : { \"total\" : { \"value\" : 1, \"relation\" : \"eq\" }, \"max_score\" : 0.9530773, \"hits\" : [ { \"_index\" : \"test\", \"_type\" : \"_doc\", \"_id\" : \"1\", \"_score\" : 0.9530773, \"_source\" : { \"words\" : \"身上有虱子\" } } ] }}

显然，第二个结果是才是我们所希望的。

自动补全

es提供了completion suggest 查询来实现自动补全的功能，这个查询会匹配用户输入开头的词条并返回。

参与补全查询的字段必须是completion类型的，字段里内容是参与补全的多个词条。

自动补全（DSL实现）

创建一个game索引库，里面仅有一个completion类型的字段——title

PUT /game{ \"settings\": { \"analysis\": { \"analyzer\": { \"my_analyzer\": { \"tokenizer\": \"ik_max_word\", \"filter\": \"py\" } }, \"filter\": { \"py\": { \"type\": \"pinyin\", \"keep_full_pinyin\": false, \"keep_joined_full_pinyin\": true, \"keep_original\": true, \"limit_first_letter_length\": 16, \"remove_duplicated_term\": true, \"none_chinese_pinyin_tokenize\": false } } } }, \"mappings\": { \"properties\": { \"title\": { \"type\": \"completion\", \"analyzer\": \"my_analyzer\", \"search_analyzer\": \"ik_max_word\" } } }}

POST /game/_bulk{\"index\":{\"_id\":1}}{\"title\":[\"原神\",\"开放世界\",\"角色扮演\",\"动作冒险\",\"多平台\",\"米哈游\"]}{\"index\":{\"_id\":2}}{\"title\":[\"王者荣耀\",\"MOBA\",\"5v5\",\"竞技\",\"手游\"]}{\"index\":{\"_id\":3}}{\"title\":[\"绝地求生\",\"大逃杀\",\"FPS\",\"射击\",\"Steam\"]}{\"index\":{\"_id\":4}}{\"title\":[\"英雄联盟\",\"MOBA\",\"PC\",\"竞技\",\"团队合作\"]}{\"index\":{\"_id\":5}}{\"title\":[\"崩坏：星穹铁道\",\"角色扮演\",\"回合制\",\"科幻\",\"米哈游\"]}

测试案例1

GET /game/_search{ \"suggest\": { \"game_suggest\": { \"text\": \"mi\", \"completion\": { \"field\": \"title\", \"skip_duplicates\":false, \"size\": 5 } } }}

{ \"took\" : 0, \"timed_out\" : false, \"_shards\" : { \"total\" : 1, \"successful\" : 1, \"skipped\" : 0, \"failed\" : 0 }, \"hits\" : { \"total\" : { \"value\" : 0, \"relation\" : \"eq\" }, \"max_score\" : null, \"hits\" : [ ] }, \"suggest\" : { \"game_suggest\" : [ { \"text\" : \"mi\", \"offset\" : 0, \"length\" : 2, \"options\" : [ { \"text\" : \"米哈游\", \"_index\" : \"game\", \"_type\" : \"_doc\", \"_id\" : \"1\", \"_score\" : 1.0, \"_source\" : {  \"title\" : [ \"原神\", \"开放世界\", \"角色扮演\", \"动作冒险\", \"多平台\", \"米哈游\"  ] } }, { \"text\" : \"米哈游\", \"_index\" : \"game\", \"_type\" : \"_doc\", \"_id\" : \"5\", \"_score\" : 1.0, \"_source\" : {  \"title\" : [ \"崩坏：星穹铁道\", \"角色扮演\", \"回合制\", \"科幻\", \"米哈游\"  ] } } ] } ] }}

测试案例2

GET /game/_search{ \"suggest\": { \"game_suggest\": { \"text\": \"ha\", \"completion\": { \"field\": \"title\", \"skip_duplicates\":false, \"size\": 5 } } }}

{ \"took\" : 0, \"timed_out\" : false, \"_shards\" : { \"total\" : 1, \"successful\" : 1, \"skipped\" : 0, \"failed\" : 0 }, \"hits\" : { \"total\" : { \"value\" : 0, \"relation\" : \"eq\" }, \"max_score\" : null, \"hits\" : [ ] }, \"suggest\" : { \"game_suggest\" : [ { \"text\" : \"ha\", \"offset\" : 0, \"length\" : 2, \"options\" : [ ] } ] }}

RestAPI实现自动补全

@Testvoid testSuggest() throws Exception {SearchRequest request = new SearchRequest(\"game\");request.source().suggest(new SuggestBuilder() .addSuggestion(\"game_suggest\", SuggestBuilders.completionSuggestion(\"title\").prefix(\"mi\").skipDuplicates(false).size(5)));SearchResponse response = client.search(request, RequestOptions.DEFAULT);CompletionSuggestion completionSuggestion = response.getSuggest().getSuggestion(\"game_suggest\");for (CompletionSuggestion.Entry entry : completionSuggestion.getEntries()) { for (CompletionSuggestion.Entry.Option option : entry) { // 获取补全文本 String suggestedText = option.getText().string(); // 获取关联文档的_source（如果有） Map source = option.getHit().getSourceAsMap(); System.out.println(\"命中: \" + suggestedText); System.out.println(\"关联文档: \" + source); }}}

命中: 米哈游关联文档: {title=[原神, 开放世界, 角色扮演, 动作冒险, 多平台, 米哈游]}命中: 米哈游关联文档: {title=[崩坏：星穹铁道, 角色扮演, 回合制, 科幻, 米哈游]}

es 自动补全

安装拼音 分词器

自定义分词器

拼音分词器——可选配置

1. 首字母处理配置

2. 完整拼音处理

3. 非中文处理配置

4. 高级处理配置

5. 系统行为配置

自定义分词器的工作原理

案例

自动补全

自动补全（DSL实现）

RestAPI实现自动补全

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

es 自动补全

安装拼音分词器

自定义分词器

拼音分词器——可选配置

1. 首字母处理配置

2. 完整拼音处理

3. 非中文处理配置

4. 高级处理配置

5. 系统行为配置

自定义分词器的工作原理

案例

自动补全

自动补全（DSL实现）

RestAPI实现自动补全

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签