> 文档中心 > Elasticsearch(二)

Elasticsearch(二)


Elasticsearch(二)

一. analysis与analyzer

​ analysis,文本分析是将全文本转换为一系列单词的过程,也叫分词。analysis是通过analyzer(分词器)来实现的,可以使用Elasticsearch内置的分词器,也可以自己去定制一些分词器。除了在数据写入的是词条进行转换,那么在查询的时候也需要使用相同的分析器对语句进行分析。

​ anaylzer是由三部分组成,例如有

Hello a World, the world is beautifu

1. Character Filter: 将文本中html标签剔除掉。2. Tokenizer: 按照规则进行分词,在英文中按照空格分词。3. Token Filter: 去掉stop world(停顿词,a, an, the, is),然后转换小写。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Qo4tX983-1580444315063)(images2/analysis.png)]

1.1 内置的分词器

分词器名称 处理过程
Standard Analyzer 默认的分词器,按词切分,小写处理
Simple Analyzer 按照非字母切分(符号被过滤),小写处理
Stop Analyzer 小写处理,停用词过滤(the, a, this)
Whitespace Analyzer 按照空格切分,不转小写
Keyword Analyzer 不分词,直接将输入当做输出
Pattern Analyzer 正则表达式,默认是\W+(非字符串分隔)

1.2 内置分词器示例

A. Standard Analyzer

GET _analyze{  "analyzer": "standard",  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"}

B. Simple Analyzer

GET _analyze{  "analyzer": "simple",  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"}

C. Stop Analyzer

GET _analyze{  "analyzer": "stop",  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"}

D. Whitespace Analyzer

GET _analyze{  "analyzer": "whitespace",  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"}

E. Keyword Analyzer

GET _analyze{  "analyzer": "keyword",  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"}

F. Pattern Analyzer

GET _analyze{  "analyzer": "pattern",  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"}

1.3 中文分词

​ 中文分词在所有的搜索引擎中都是一个很大的难点,中文的句子应该是切分成一个个的词,一句中文,在不同的上下文中,其实是有不同的理解,例如下面这句话:

这个苹果,不大好吃/这个苹果,不大,好吃
1.3.1 IK分词器

IK分词器支持自定义词库,支持热更新分词字典,地址为 https://github.com/medcl/elasticsearch-analysis-ik

elasticsearch-plugin.bat install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip

安装步骤:

  1. 下载zip包,下载路径为:https://github.com/medcl/elasticsearch-analysis-ik/releases
  2. 在Elasticsearch的plugins目录下创建名为 analysis-ik 的目录,将下载好的zip包解压在该目录下
  3. 在dos命令行进入Elasticsearch的bin目录下,执行 elasticsearch-plugin.bat list 即可查看到该插件

IK分词插件对应的分词器有以下几种:

  • ik_smart
  • ik_max_word
1.3.2 HanLP

安装步骤如下:

  1. 下载ZIP包,下载路径为:https://pan.baidu.com/s/1mFPNJXgiTPzZeqEjH_zifw#list/path=%2F,密码i0o7
  2. 在Elasticsearch的plugins目录下创建名为 analysis-hanlp 的目录,将下载好的zip包解压在该目录下.
  3. 下载词库,地址为:https://github.com/hankcs/HanLP/releases
  4. 将analyzer-hanlp目录下的data目录删掉,然后将词库解压到anayler-hanlp目录下

HanLP对应的分词器如下:

  • hanlp,默认的分词
  • hanlp_standard,标准分词
  • hanlp_index,索引分词
  • hanlp_nlp,nlp分词
  • hanlp_n_short,N-最短路分词
  • hanlp_dijkstra,最短路分词
  • hanlp_speed,极速词典分词
1.3.3 pinyin分词器

安装步骤:

  1. 下载ZIP包,下载路径为:https://github.com/medcl/elasticsearch-analysis-pinyin/releases
  2. 在Elasticsearch的plugins目录下创建名为 analyzer-pinyin 的目录,将下载好的zip包解压在该目录下.

1.4 中文分词演示

ik_smart

GET _analyze{  "analyzer": "ik_smart",  "text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]}

hanlp

GET _analyze{  "analyzer": "hanlp",  "text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]}

hanlp_standard

GET _analyze{  "analyzer": "hanlp_standard",  "text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]}

hanlp_speed

GET _analyze{  "analyzer": "hanlp_speed",  "text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]}

1.5 分词的实际应用

​ 在如上列举了很多的分词器,那么在实际中该如何应用?

1.5.1 设置mapping

​ 要想使用分词器,先要指定我们想要对那个字段使用何种分词,如下所示:

PUT customers{  "mappings": {    "properties": {      "content": { "type": "text", "analyzer": "hanlp_standard"      }    }  }}
1.5.2 插入数据
POST customers/_bulk{"index":{}}{"content":"如不能登录,请在百端登录百度首页,点击【登录遇到问题】,进行找回密码操作"}{"index":{}}{"content":"网盘客户端访问隐藏空间需要输入密码方可进入。"}{"index":{}}{"content":"剑桥的网盘不好用"}
1.5.3 查询
GET customers/_search{  "query": {    "match": {      "content": "密码"    }  }}

1.6 拼音分词器

​ 在查询的过程中我们可能需要使用拼音来进行查询,在中文分词器中我们介绍过 pinyin 分词器,那么在实际的工作中该如何使用呢?

1.6.1 设置settings
PUT /medcl {    "settings" : { "analysis" : {     "analyzer" : {  "pinyin_analyzer" : {      "tokenizer" : "my_pinyin"   }     },     "tokenizer" : {  "my_pinyin" : {      "type" : "pinyin",      "keep_separate_first_letter" : false,      "keep_full_pinyin" : true,      "keep_original" : true,      "limit_first_letter_length" : 16,      "lowercase" : true,      "remove_duplicated_term" : true  }     } }    }}

如上所示,我们基于现有的拼音分词器定制了一个名为 pinyin_analyzer 这样一个分词器。可用的参数可以参照:https://github.com/medcl/elasticsearch-analysis-pinyin

1.6.2 设置mapping
PUT medcl/_mapping{ "properties": {     "name": {  "type": "keyword",  "fields": {      "pinyin": {   "type": "text",   "analyzer": "pinyin_analyzer",   "boost": 10      }  }     } }}
1.6.3 数据的插入
POST medcl/_bulk{"index":{}}{"name": "刘德华"}{"index":{}}{"name": "张学友"}{"index":{}}{"name": "四大天王"}{"index":{}}{"name": "柳岩"}{"index":{}}{"name": "angel baby"}
1.6.4 查询
GET medcl/_search{  "query": {    "match": {      "name.pinyin": "ldh"    }  }}

1.7 中文、拼音混合查找

1.7.1 设置settings
PUT goods{  "settings": {    "analysis": {      "analyzer": { "hanlp_standard_pinyin":{   "type": "custom",   "tokenizer": "hanlp_standard",   "filter": ["my_pinyin"] }      },      "filter": { "my_pinyin": {   "type" : "pinyin",   "keep_separate_first_letter" : false,   "keep_full_pinyin" : true,   "keep_original" : true,   "limit_first_letter_length" : 16,   "lowercase" : true,   "remove_duplicated_term" : true }      }    }  }}
1.7.2 mappings设置
PUT goods/_mapping{"properties": {    "content": {      "type": "text",      "analyzer": "hanlp_standard_pinyin"    }  }}
1.7.3 添加数据
POST goods/_bulk{"index":{}}{"content":"如不能登录,请在百端登录百度首页,点击【登录遇到问题】,进行找回密码操作"}{"index":{}}{"content":"网盘客户端访问隐藏空间需要输入密码方可进入。"}{"index":{}}{"content":"剑桥的网盘不好用"}
1.7.4 查询
GET goods/_search{  "query": {    "match": {      "content": "caozuo"    }  },  "highlight": {    "pre_tags": "",    "post_tags": "",    "fields": {      "content": {}    }  }}

二. spring boot与Elasticsearch的整合

2.1 添加依赖

<dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-data-elasticsearch</artifactId></dependency>

2.2 配置

spring:  elasticsearch:    rest:      uris: http://localhost:9200

2.3 获取ElasticsearchTemplate

@Configurationpublic class ElasticsearchConfig extends ElasticsearchConfigurationSupport {    @Bean    public Client elasticsearchClient() throws UnknownHostException { Settings settings = Settings.builder().put("cluster.name", "my-application").build(); TransportClient client = new PreBuiltTransportClient(settings); client.addTransportAddress(new TransportAddress(InetAddress.getByName("127.0.0.1"), 9300)); return client;    }    @Bean(name = {"elasticsearchOperations", "elasticsearchTemplate"})    public ElasticsearchTemplate elasticsearchTemplate() throws UnknownHostException { return new ElasticsearchTemplate(elasticsearchClient(), entityMapper());    }    // use the ElasticsearchEntityMapper    @Bean    @Override    public EntityMapper entityMapper() { ElasticsearchEntityMapper entityMapper = new ElasticsearchEntityMapper(elasticsearchMappingContext(),  new DefaultConversionService()); entityMapper.setConversions(elasticsearchCustomConversions()); return entityMapper;    }}

2.4 POJO类的定义

@Document(indexName = "movies", type = "_doc")public class Movie {    private String id;    private String title;    private Integer year;    private List<String> genre;    // setters and getters}

2.5 查询

A. 分页查询

// 分页查询@RequestMapping("/page")public Object pageQuery(    @RequestParam(required = false, defaultValue = "10") Integer size,    @RequestParam(required = false, defaultValue = "1") Integer page) {    SearchQuery searchQuery = new NativeSearchQueryBuilder() .withPageable(PageRequest.of(page, size)) .build();    List<Movie> movies = elasticsearchTemplate .queryForList(searchQuery, Movie.class);    return movies;}

B. range查询

// 单条件范围查询, 查询电影的上映日期在2016年到2018年间的所有电影@RequestMapping("/range")public Object rangeQuery() {SearchQuery searchQuery = new NativeSearchQueryBuilder().withQuery(new RangeQueryBuilder("year").from(2016).to(2018)).build();List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);return movies;}

C. match查询

// 单条件查询只要包含其中一个字段@RequestMapping("/match")public Object singleCriteriaQuery(String searchText) {SearchQuery searchQuery = new NativeSearchQueryBuilder().withQuery(new MatchQueryBuilder("title", searchText)).build();List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);return movies;}

D. 多条件分页查询

@RequestMapping("/match/multiple")    public Object multiplePageQuery(     @RequestParam(required = true) String searchText,     @RequestParam(required = false, defaultValue = "10") Integer size,     @RequestParam(required = false, defaultValue = "1") Integer page) { SearchQuery searchQuery = new NativeSearchQueryBuilder()     .withQuery(    new BoolQueryBuilder()   .must(new MatchQueryBuilder("title", searchText))   .must(new RangeQueryBuilder("year").from(2016).to(2018))  ).withPageable(PageRequest.of(page, size))  .build(); List<Movie> movies = elasticsearchTemplate     .queryForList(searchQuery, Movie.class); return movies;    }

E. 多条件或者查询

// 多条件并且分页查询    @RequestMapping("/match/or/multiple")    public Object multipleOrQuery(@RequestParam(required = true) String searchText) { SearchQuery searchQuery = new NativeSearchQueryBuilder()     .withQuery(    new BoolQueryBuilder()   .should(new MatchQueryBuilder("title", searchText))   .should(new RangeQueryBuilder("year").from(2016).to(2018))  ).build(); List<Movie> movies = elasticsearchTemplate     .queryForList(searchQuery, Movie.class); return movies;    }

F. 精准匹配一个单词,且查询就一个单词

//其中包含有某个给定单词,必须是一个词@RequestMapping("/term")public Object termQuery(@RequestParam(required = true) String searchText) {    SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery(new TermQueryBuilder("title", searchText)).build();    List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);    return movies;}

精准匹配多个单词

//其中包含有某个几个单词@RequestMapping("/terms")public Object termsQuery(@RequestParam(required = true) String searchText) {    SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery(new TermsQueryBuilder("title", searchText.split("\\s+"))).build();    List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);    return movies;}

G. 短语匹配

@RequestMapping("/phrase")public Object phraseQuery(@RequestParam(required = true) String searchText) {SearchQuery searchQuery = new NativeSearchQueryBuilder().withQuery(new MatchPhraseQueryBuilder("title", searchText)).build();List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);return movies;}

H. 只查询部分列

@RequestMapping("/source")public Object sourceQuery(@RequestParam(required = true) String searchText) {SearchQuery searchQuery = new NativeSearchQueryBuilder().withSourceFilter(new FetchSourceFilter( new String[]{"title", "year", "id"}, new String[]{})).withQuery(new MatchPhraseQueryBuilder("title", searchText)).build();List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);return movies;}

I. 多字段匹配

@RequestMapping("/multiple/field")public Object allTermsQuery(@RequestParam(required = true) String searchText) {SearchQuery searchQuery = new NativeSearchQueryBuilder().withQuery(new MultiMatchQueryBuilder(searchText, "title", "genre")     .type(MultiMatchQueryBuilder.Type.MOST_FIELDS)).build();List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);return movies;}

J. 多单词同时包含

// 多单词同时包含@RequestMapping("/also/include")public Object alsoInclude(@RequestParam(required = true) String searchText) {    SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery(new QueryStringQueryBuilder(searchText)     .field("title").defaultOperator(Operator.AND)) .build();    List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);    return movies;}

三. logstash导入mysql数据

input {  jdbc {    jdbc_driver_class => "com.mysql.jdbc.Driver"    jdbc_connection_string => "jdbc:mysql://localhost:3306/es?useSSL=false&serverTimezone=UTC"    jdbc_user => es    jdbc_password => "123456"    #启用追踪,如果为true,则需要指定tracking_column    use_column_value => false    #指定追踪的字段,    tracking_column => "id"    #追踪字段的类型,目前只有数字(numeric)和时间类型(timestamp),默认是数字类型    tracking_column_type => "numeric"    #记录最后一次运行的结果    record_last_run => true    #上面运行结果的保存位置    last_run_metadata_path => "mysql-position.txt"    statement => "SELECT * FROM news where tags is not null"    #表示每天的 17:57分执行    schedule => " 0 57 17 * * *"  }}filter {  mutate {    split => { "tags" => ","}  }}output {  elasticsearch {    document_id => "%{id}"    document_type => "_doc"    index => "news"    hosts => ["http://localhost:9200"]  }  stdout{    codec => rubydebug  }}

四. 搜索案例

4.1 自定义analyzer

PUT news{  "settings": {    "analysis": {      "analyzer": { "hanlp_standard_pinyin":{   "type": "custom",   "tokenizer": "hanlp_standard",   "filter": ["my_pinyin"] }      },      "filter": { "my_pinyin": {   "type" : "pinyin",   "keep_separate_first_letter" : false,   "keep_full_pinyin" : true,   "keep_original" : true,   "limit_first_letter_length" : 16,   "lowercase" : true,   "remove_duplicated_term" : true }      }    }  }}

4.2 定义mappings

PUT news/_mapping{  "dynamic": false,  "properties": {    "id": {      "type": "long"    },    "title": {      "type": "text",      "analyzer": "hanlp_standard"    },    "content": {      "type": "text",      "analyzer": "hanlp_standard"    },    "tags": {      "type": "completion",      "analyzer": "hanlp_standard",      "fields": { "tag_pinyin": {   "type": "completion",   "analyzer": "hanlp_standard_pinyin" }      }    }  }}

4.3 导入mysql的数据集

D:\logstash-datas\bin>logstash.bat -f ../config/logstash-mysql.conf

脚本参照第三章,数据库的脚本为news.sql

附录:

  1. 设置mappings的时候,可以指定 “dynamic”: false,意思是如果mappings中有些字段并没有指定,那么在数据导入的时候,该字段的数据会存入到es中,但是不会进行分词。
  2. 在使用suggestion的时候,“skip_duplicates”: true,表示的意思是如果出现相同的建议,那么只会保留一个。