分布式搜索引擎-ElasticSearch(下集)
个人简介
作者是一个来自河源的大三在校生,以下笔记都是作者自学之路的一些浅薄经验,如有错误请指正,将来会不断的完善笔记,帮助更多的Java爱好者入门。
文章目录
-
- 个人简介
- 分布式搜索引擎-ElasticSearch(下集)
-
- 什么是ElasticSearch
-
- 分页
- 字段高亮(highlight)
-
- 模仿百度搜索高亮
- bool查询(用作于多条件查询)
- 过滤器,区间条件(filter range)
- 查看整个es的索引信息
- elasticsearch的Java Api
-
- 准备阶段
- 索引操作
-
- 创建索引
- 删除索引
- 检查索引是否存在
- 文档操作
-
- 创建指定id的文档
- 删除指定id的文档
- 修改指定id的文档
- 获取指定id的文档
- 搜索(匹配全文match_all)
- 搜索(模糊查询match)
- 搜索(多字段搜索multi_match)
- 搜索(筛选字段fetchSource)
- 分页、排序、字段高亮
- 布尔搜索(bool)
- es实战(京东商品搜索)
-
- 从京东上爬取数据
分布式搜索引擎-ElasticSearch(下集)
- 注意:ElasticSearch版本为7.6.1
什么是ElasticSearch
ElasticSearch是一个基于Lucene的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口。Elasticsearch是用Java开发的,并作为Apache许可条款下的开放源码发布,是当前流行的企业级搜索引擎。设计用于云计算中,能够达到实时搜索,稳定,可靠,快速,安装使用方便。
我们建立一个网站或应用程序,并要添加搜索功能,但是想要完成搜索工作的创建是非常困难的。我们希望搜索解决方案要运行速度快,我们希望能有一个零配置和一个完全免费的搜索模式,我们希望能够简单地使用JSON通过HTTP来索引数据,我们希望我们的搜索服务器始终可用,我们希望能够从一台开始并扩展到数百台,我们要实时搜索,我们要简单的多租户,我们希望建立一个云的解决方案。因此我们利用Elasticsearch来解决所有这些问题及可能出现的更多其它问题。摘选自《百度百科》
分页
GET goods/_search{ "query": { "match_all": {} } , "sort": [ {"od": { "order": "desc"} } ] , "from" : 0 , "size": 2}
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 4, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "goods", "_type" : "_doc", "_id" : "4", "_score" : null, "_source" : { "title" : "IQOONEO5", "content" : "IQOONEO5 高通骁龙870Soc ,", "price" : "2499", "od" : 4 }, "sort" : [ 4 ] }, { "_index" : "goods", "_type" : "_doc", "_id" : "3", "_score" : null, "_source" : { "title" : "小米11", "content" : "小米11 高通骁龙888Soc ,1亿像素", "price" : "4500", "od" : 3 }, "sort" : [ 3 ] } ] }}
字段高亮(highlight)
可以选择一个或者多个字段高亮,然后被选择的这些字段如果被条件匹配到则会默认加em标签
GET goods/_search{ "query": { "match": {"title": "华为P40" } }, "highlight": { "fields": {"title": {} } } }
结果
{ "took" : 6, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 2.7309713, "hits" : [ { "_index" : "goods", "_type" : "_doc", "_id" : "1", "_score" : 2.7309713, "_source" : { "title" : "华为P40", "content" : "华为P40 8+256G,麒麟990Soc,贼牛逼", "price" : "4999", "od" : 1 }, "highlight" : { "title" : [ "华为P40" ] } }, { "_index" : "goods", "_type" : "_doc", "_id" : "2", "_score" : 1.5241971, "_source" : { "title" : "华为Mate30", "content" : "华为Mate30 8+128G,麒麟990Soc", "price" : "3998", "od" : 2 }, "highlight" : { "title" : [ "华为Mate30" ] } } ] }}
默认是em标签,我们可以更改他的前缀和后缀,利用前端的知识
GET goods/_search{ "query": { "match": {"title": "华为P40" } }, "highlight": { "pre_tags": "", "post_tags": "" , "fields": {"title": {} } } }
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 2.7309713, "hits" : [ { "_index" : "goods", "_type" : "_doc", "_id" : "1", "_score" : 2.7309713, "_source" : { "title" : "华为P40", "content" : "华为P40 8+256G,麒麟990Soc,贼牛逼", "price" : "4999", "od" : 1 }, "highlight" : { "title" : [ "华为P40" ] } }, { "_index" : "goods", "_type" : "_doc", "_id" : "2", "_score" : 1.5241971, "_source" : { "title" : "华为Mate30", "content" : "华为Mate30 8+128G,麒麟990Soc", "price" : "3998", "od" : 2 }, "highlight" : { "title" : [ "华为Mate30" ] } } ] }}
模仿百度搜索高亮
例如百度搜索华为P40,不仅仅是title会高亮,content也会高亮,所以我们可以用multi_match+highlight实现
GET goods/_search{ "query": { "multi_match": { "query": "华为P40", "fields": ["title","content"] } } , "highlight": { "pre_tags": "", "post_tags": "", "fields": { "title": {}, "content": {} } } }
{ "took" : 8, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 2.8157697, "hits" : [ { "_index" : "goods", "_type" : "_doc", "_id" : "1", "_score" : 2.8157697, "_source" : { "title" : "华为P40", "content" : "华为P40 8+256G,麒麟990Soc,贼牛逼", "price" : "4999", "od" : 1 }, "highlight" : { "title" : [ "华为P40" ], "content" : [ "华为P40 8+256G,麒麟990Soc,贼牛逼" ] } }, { "_index" : "goods", "_type" : "_doc", "_id" : "2", "_score" : 1.8023796, "_source" : { "title" : "华为Mate30", "content" : "华为Mate30 8+128G,麒麟990Soc", "price" : "3998", "od" : 2 }, "highlight" : { "title" : [ "华为Mate30" ], "content" : [ "华为Mate30 8+128G,麒麟990Soc" ] } } ] }}
bool查询(用作于多条件查询)
类似于MYSQL的and or
重点:must 代表and ,should 代表 or
must(and)的使用:
下面我们在must里面给了两个条件,如果这里是must,那就必须两个条件都要满足
GET goods/_search{ "query": { "bool": { "must": [ { "match": { "title": "华为" } }, { "match": {"content": "MATE30" } } ] } }}
结果:
{ "took" : 10, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 2.9512205, "hits" : [ { "_index" : "goods", "_type" : "_doc", "_id" : "2", "_score" : 2.9512205, "_source" : { "title" : "华为Mate30", "content" : "华为Mate30 8+128G,麒麟990Soc", "price" : "3998", "od" : 2 } } ] }}
should(or)的使用:
should里面同样有两个条件,但是只要满足一个就可以了
GET goods/_search{ "query": { "bool": { "should": [ { "match": { "title": "华为" } }, { "match": {"content": "MATE30" } }] } }}
结果:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 2.9512205, "hits" : [ { "_index" : "goods", "_type" : "_doc", "_id" : "2", "_score" : 2.9512205, "_source" : { "title" : "华为Mate30", "content" : "华为Mate30 8+128G,麒麟990Soc", "price" : "3998", "od" : 2 } }, { "_index" : "goods", "_type" : "_doc", "_id" : "1", "_score" : 1.5241971, "_source" : { "title" : "华为P40", "content" : "华为P40 8+256G,麒麟990Soc,贼牛逼", "price" : "4999", "od" : 1 } } ] }}
过滤器,区间条件(filter range)
比如我们要实现,输入title=xx,我们如果想得到price>4000作为一个条件,可以用到这个。
GET goods/_search{ "query": { "bool": { "must": [ { "match": { "title": "小米" } } ],"filter": { "range": { "price": {"gt": 4000 } } } } }}
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 2.4135482, "hits" : [ { "_index" : "goods", "_type" : "_doc", "_id" : "3", "_score" : 2.4135482, "_source" : { "title" : "小米11", "content" : "小米11 高通骁龙888Soc ,1亿像素", "price" : "4500", "od" : 3 } } ] }}
查看整个es的索引信息
GET _cat/indices?v
elasticsearch的Java Api
准备阶段
1.导入elasticsearch高级客户端依赖和elasticsearch依赖(注意版本要和本机的es版本一致),我们本机现在用的是7.6.1的es
<dependency> <groupId>org.elasticsearch.client</groupId> <artifactId>elasticsearch-rest-high-level-client</artifactId> <version>7.6.1</version> </dependency> <dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch</artifactId> <version>7.6.1</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.75</version> </dependency>
2.打开RestHighLevelClient的构造器:
public RestHighLevelClient(RestClientBuilder restClientBuilder) { this(restClientBuilder, Collections.emptyList()); }
我们发现需要传入一个RestClientBuilder,但是这个对象我们需要通过RestClient来得到,而不是RestClientBuilder
3.打开RestClient:
public static RestClientBuilder builder(HttpHost... hosts) { if (hosts == null || hosts.length == 0) { throw new IllegalArgumentException("hosts must not be null nor empty"); } List nodes = Arrays.stream(hosts).map(Node::new).collect(Collectors.toList()); return new RestClientBuilder(nodes); }
我们发现RestClient的builder可以得到RestClientBuilder,然后我们点进去看HttpHost:
public HttpHost(String hostname, int port, String scheme) { //es所在主机名,es的端口号,协议(默认http) this.hostname = (String)Args.containsNoBlanks(hostname, "Host name"); this.lcHostname = hostname.toLowerCase(Locale.ROOT); if (scheme != null) { this.schemeName = scheme.toLowerCase(Locale.ROOT); } else { this.schemeName = "http"; } this.port = port; this.address = null; }
4.然后我们就配置好了如下:
HttpHost httpHost = new HttpHost("localhost",9200,"http"); RestClientBuilder restClientBuilder = RestClient.builder(httpHost); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(restClientBuilder);
5.为了方便,我们可以把这个RestHighLevelClient交给SpringIOC容器管理,后面我们自动注入即可
@Configurationpublic class esConfig { @Bean public RestHighLevelClient restHighLevelClient(){ HttpHost httpHost = new HttpHost("localhost",9200,"http"); RestClientBuilder builder = RestClient.builder(httpHost); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); return restHighLevelClient; } }
索引操作
java elasticsearch api操作索引都是用restHighLevelClient.indices().xxxxx()的格式
创建索引
//创建索引 @Test public void createIndex() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); //new一个创建索引请求,并传入一个创建的索引名称 CreateIndexRequest createIndexRequest = new CreateIndexRequest("java01"); //向es发送创建索引请求。 CreateIndexResponse createIndexResponse = restHighLevelClient.indices().create(createIndexRequest, RequestOptions.DEFAULT); restHighLevelClient.close(); }
删除索引
//删除索引 @Test public void deleteIndex() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); //new一个删除索引请求,并传入需要删除的索引名称 DeleteIndexRequest deleteIndexRequest = new DeleteIndexRequest("java01"); //resthighLevelClient发送删除索引请求 restHighLevelClient.indices().delete(deleteIndexRequest,RequestOptions.DEFAULT); restHighLevelClient.close(); }
检查索引是否存在
//检查索引是否存在 @Test public void indexExsit() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); GetIndexRequest getIndexRequest = new GetIndexRequest("goods"); boolean exists = restHighLevelClient.indices().exists(getIndexRequest, RequestOptions.DEFAULT); System.out.println(exists); }
文档操作
创建指定id的文档
//创建文档 @Test public void createIndexDoc() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); IndexRequest indexRequest = new IndexRequest("hello"); //指定文档id indexRequest.id("1"); / * public IndexRequest source(Map source, XContentType contentType) throws ElasticsearchGenerationException { * try { * XContentBuilder builder = XContentFactory.contentBuilder(contentType); * builder.map(source); * return this.source(builder); * } catch (IOException var4) { * throw new ElasticsearchGenerationException("Failed to generate [" + source + "]", var4); * } * } * source有很多种方法,哪种都可以,我现在选的是Map的方法添加key:value */ Map<String,Object> source=new HashMap<>(); source.put("a_age","50"); source.put("a_address","广州"); //在es里面,一切皆为JSON,我们要把Map用fastjson转换成JSON字符串,XContentType指定为JSON类型 indexRequest.source(JSON.toJSONString(source), XContentType.JSON); IndexResponse response = restHighLevelClient.index(indexRequest, RequestOptions.DEFAULT); System.out.println("response:"+response); System.out.println("status:"+response.status()); }
删除指定id的文档
//删除文档 @Test public void deleteDoc() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); DeleteRequest deleteRequest = new DeleteRequest("hello"); deleteRequest.id("1"); DeleteResponse delete = restHighLevelClient.delete(deleteRequest, RequestOptions.DEFAULT); System.out.println(delete.status()); }
修改指定id的文档
//修改文档 @Test public void updateDoc() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); / * 通过下面的方法去调用 * public UpdateRequest(String index, String id) { * super(index); * this.refreshPolicy = RefreshPolicy.NONE; * this.waitForActiveShards = ActiveShardCount.DEFAULT; * this.scriptedUpsert = false; * this.docAsUpsert = false; * this.detectNoop = true; * this.id = id; * } */ UpdateRequest updateRequest = new UpdateRequest("hello","1"); Map<String,Object> source=new HashMap<>(); source.put("a_address","河源"); updateRequest.doc(JSON.toJSONString(source),XContentType.JSON); UpdateResponse response = restHighLevelClient.update(updateRequest, RequestOptions.DEFAULT); System.out.println(response.status()); }
获取指定id的文档
//获取文档 @Test public void getDoc() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); GetRequest getRequest = new GetRequest("hello"); getRequest.id("1"); GetResponse response = restHighLevelClient.get(getRequest, RequestOptions.DEFAULT); String sourceAsString = response.getSourceAsString(); System.out.println(sourceAsString); }
搜索(匹配全文match_all)
//搜索(匹配全文match_all) @Test public void search_matchAll() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); / * public SearchRequest(String... indices) { * this(indices, new SearchSourceBuilder()); * } */ SearchRequest searchRequest = new SearchRequest("hello"); //相当于文本 SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); MatchAllQueryBuilder matchAllQueryBuilder = QueryBuilders.matchAllQuery(); searchSourceBuilder.query(matchAllQueryBuilder); //相当于search的query searchRequest.source(searchSourceBuilder); SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); SearchHit[] hits = search.getHits().getHits(); for (SearchHit hit : hits) { System.out.println(hit.getSourceAsString()); } }
搜索(模糊查询match)
//模糊搜索match @Test public void search_match() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); SearchRequest searchRequest = new SearchRequest(); //查询文本 SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); MatchQueryBuilder matchQueryBuilder = QueryBuilders.matchQuery("a_address", "广州"); searchSourceBuilder.query(matchQueryBuilder); searchRequest.source(searchSourceBuilder); SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); SearchHit[] hits = search.getHits().getHits(); for (SearchHit hit : hits) { System.out.println(hit.getSourceAsString()); } }
搜索(多字段搜索multi_match)
//搜索(多字段搜索multi_match) @Test public void search_term() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); SearchRequest searchRequest = new SearchRequest("goods"); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.query(QueryBuilders.multiMatchQuery("华为","title","content")); searchRequest.source(searchSourceBuilder); SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); SearchHit[] hits = search.getHits().getHits(); for (SearchHit hit : hits) { System.out.println(hit.getSourceAsString()); } }
搜索(筛选字段fetchSource)
fetchsource方法相当于_source
//fetchsource实现筛选字段(_source) @Test public void search_source() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); SearchRequest searchRequest = new SearchRequest("goods"); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.query(QueryBuilders.matchAllQuery()); / * public SearchSourceBuilder fetchSource(@Nullable String[] includes, @Nullable String[] excludes) { * FetchSourceContext fetchSourceContext = this.fetchSourceContext != null ? this.fetchSourceContext : FetchSourceContext.FETCH_SOURCE; * this.fetchSourceContext = new FetchSourceContext(fetchSourceContext.fetchSource(), includes, excludes); * return this; * } * */ String[] includes={"title"}; //包含 String[] excludes={}; //排除 searchSourceBuilder.fetchSource(includes,excludes); searchRequest.source(searchSourceBuilder); SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); SearchHit[] hits = search.getHits().getHits(); for (SearchHit hit : hits) { System.out.println(hit.getSourceAsString()); } }
分页、排序、字段高亮
我们要把下面的es命令行代码转换成Java代码
GET goods/_search{ "query": { "match": { "title": "华为" } },"sort": [ { "od": { "order": "desc" } } ] ,"from": 0, "size": 1, "highlight": { "pre_tags": "", "post_tags": "", "fields": { "title": {} } } }
Java 实现
//分页,排序,字段高亮 @Test public void page_sort_HighLight() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); SearchRequest searchRequest = new SearchRequest("goods"); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); MatchQueryBuilder matchQueryBuilder = QueryBuilders.matchQuery("title", "华为"); searchSourceBuilder.query(matchQueryBuilder); //分页==== searchSourceBuilder.from(0); searchSourceBuilder.size(1); //======= //排序 searchSourceBuilder.sort("od", SortOrder.DESC); //字段高亮 //=========高亮开始== HighlightBuilder highlightBuilder = new HighlightBuilder(); //构建高亮的前缀后缀标签pre_tag和post_tag highlightBuilder.preTags(""); highlightBuilder.postTags(""); //highlightBuilder.field()方法我们用一个String类型的 / * public HighlightBuilder field(String name) { * return this.field(new HighlightBuilder.Field(name)); * } */ highlightBuilder.field("title"); //如果还需要更多字段高亮,则多写一遍field方法// highlightBuilder.field(); //第二个字段高亮// highlightBuilder.field(); //第三个字段高亮 。。。。。以此类推 searchSourceBuilder.highlighter(highlightBuilder); //====================高亮结束 searchRequest.source(searchSourceBuilder); SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); SearchHit[] hits = search.getHits().getHits(); //hits里面封装了命中的所有数据 for (SearchHit hit : hits) { Map<String, HighlightField> highlightFields = hit.getHighlightFields(); System.out.println("highlightMap:"+highlightFields); //通过title这个key去获取fragments //fragment里面是高亮之后的字段内容(很重要,可以用来覆盖原来没高亮的字段内容) 华为Mate30 System.out.println("fragments:"+Arrays.toString(highlightFields.get("title").getFragments())); } restHighLevelClient.close(); }
布尔搜索(bool)
实现类似如下es代码:
GET goods/_search{ "query": { "bool": { "should": [ { "term": { "title": { "value": "华" } } }, { "term": { "title": {"value": "米" } } } ] } }}
Java实现:
//布尔搜索(bool) @Test public void search_bool() throws IOException { RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http")); RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder); SearchRequest searchRequest = new SearchRequest("goods"); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); //通过searchSourceBuilder对象构建bool查询对象 BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery(); //这里should只能写一个,如should里面有多个条件,可以写多个should / * * "should": [ * { * * "term": { * "title": { *"value": "华" * } * } * * }, * { * * "term": { * "title": { * "value": "米" * } * } */ //例如上面should有两个条件,我们就要写两个should boolQueryBuilder.should(QueryBuilders.termQuery("title","华")); boolQueryBuilder.should(QueryBuilders.termQuery("title","米")); searchSourceBuilder.query(boolQueryBuilder); searchRequest.source(searchSourceBuilder); SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT); SearchHit[] hits = search.getHits().getHits(); for (SearchHit hit : hits) { System.out.println(hit.getSourceAsString()); } restHighLevelClient.close(); }
es实战(京东商品搜索)
从京东上爬取数据
1:导入依赖:
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.12.1</version> </dependency>
2.创建实体类:
public class goods{ private String img; //商品图片 private String price; //商品价格 private String title; //商品标题 public goods() { } public goods(String img, String price, String title) { this.img = img; this.price = price; this.title = title; } public String getImg() { return img; } public void setImg(String img) { this.img = img; } public String getPrice() { return price; } public void setPrice(String price) { this.price = price; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } @Override public String toString() { return "goods{" + "img='" + img + '\'' + ", price='" + price + '\'' + ", title='" + title + '\'' + '}'; }}
3.利用jsoup解析爬取京东商城搜索(核心),编写工具类:
@Componentpublic class jsoupUtils { private static RestHighLevelClient restHighLevelClient; @Autowired public void setRestHighLevelClient(RestHighLevelClient restHighLevelClient) { jsoupUtils.restHighLevelClient = restHighLevelClient; } / *封装了京东搜索功能,把搜索的数据添加进es中 */ public static void searchData_JD(String keyword) { BulkRequest bulkRequest = new BulkRequest(); try { URL url = null; try { url = new URL("https://search.jd.com/Search?keyword=" + keyword); } catch (MalformedURLException e) { e.printStackTrace(); } Document document = null;//jsoup解析URL try { document = Jsoup.parse(url, 30000); } catch (IOException e) { e.printStackTrace(); } Element e1 = document.getElementById("J_goodsList"); Elements e_lis = e1.getElementsByTag("li"); for (Element e_li : e_lis) { //这边可能获取到多个价格,因为有些有套餐价格,我们可以获取第一个价格 Elements e_price = e_li.getElementsByClass("p-price"); String text = e_price.get(0).text(); //这里获取的价格可能有多个,正常价和京东PLUS会员专享价,所以我们要进行切分 String realPirce = "¥"; int x = 1; //默认第一个就是¥的符号,也从1开始遍历,如果还有¥符号就break即可 for (int i = 1; i < text.length(); i++) { if (text.charAt(i) == '¥') { break; } else { realPirce += text.charAt(i); } } //商品图片 Elements e_img = e_li.getElementsByClass("p-img"); Elements img = e_img.get(0).getElementsByTag("img"); //因为京东的商品图片不是封装到src里面的,而是封装到懒加载属性==data-lazy-img String src = img.get(0).attr("data-lazy-img"); System.out.println("http:" + src); //价格 System.out.println(realPirce); //商品标题 Elements e_title = e_li.getElementsByClass("p-name"); String title = e_title.get(0).getElementsByTag("em").text(); System.out.println(title); IndexRequest indexRequest = new IndexRequest("jd_goods"); //添加信息 Map<String,Object> good=new HashMap<>(); good.put("img","http:" + src); good.put("price",realPirce); good.put("title",title); IndexRequest source = indexRequest.source(JSON.toJSONString(good), XContentType.JSON); bulkRequest.add(source); } //批量操作,减少访问es服务器的次数restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT); }catch (Exception e){ System.out.println(e.getMessage()); } }}
4.使用工具类:
public static void main(String[] args) { SpringApplication.run(DemoApplication.class, args); jsoupUtils.searchData_JD("vivo"); }
有了数据我们就可以用来展示到页面上了。。。。。