【项目日记（二）】搜索引擎-索引制作_索引编写

技术文档

❣博主主页: 33的博客❣
▶️文章专栏分类:项目日记◀️
🚚我的代码仓库: 33的代码仓库🚚
🫵🫵🫵关注我带你了解更多项目内容

在这里插入图片描述

1.前言

在上一篇文章中，我们已经介绍了索引解析，那么接下来我们继续完善我们的项目，既然已经有了解析好的索引，那么我们就需要把解析的内容添加到倒排索引和正排索引中。

2.索引结构

创建index类，通过这个类来构建索引结构
基本步骤：

用ArrayList创建正排索引
用HashMap创建倒排索引
1.给定docid在正排索引中，查询详细信息
2.给定一个词，在倒排索引中查与这个词的关联文档
3.往索引中新增文档
4.把内存的索引保存到磁盘
5.把磁盘的索引结构保存到内存

2.1创捷索引

正排索引

private ArrayList<DocInfo> forwardIndex=new ArrayList<>();

倒排索引

private HashMap<String,ArrayList<Weight>> invertedIndex=new HashMap<>();

DocInfo类：

public class DocInfo { private int docID; private String title; private String url; private String content; public int getDocID() { return docID; } public void setDocID(int docID) { this.docID = docID; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public String getUrl() { return url; } public void setUrl(String url) { this.url = url; } public String getContent() { return content; } public void setContent(String content) { this.content = content; }}

Weight类：

public class Weight { private int docId; private int weight; public int getDocId() { return docId; } public void setDocId(int docId) { this.docId = docId; } public int getWeight() { return weight; } public void setWeight(int weight) { this.weight = weight; }}

2.2根据索引查询

//1.根据docId查询文档详情，数组小标就是文档id public DocInfo getDocInfo(int docId){ return forwardIndex.get(docId); }//2.给定一个词，查在哪些文档中 public List<Weight> getInverted(String term){ return invertedIndex.get(term); }

2.3新增文档

 public void addDoc(String title,String url,String content){ //给正排索引新增和倒排索引都新增信息 //构建正排索引 DocInfo docInfo=buildForward(title,url,content); //创建倒排索引 buildInverted(docInfo); }

在正排索引中添加文档：

private DocInfo buildForward(String title, String url, String content) { DocInfo docInfo=new DocInfo(); docInfo.setTitle(title); docInfo.setUrl(url); docInfo.setContent(content); //巧妙设计：docInfoId的下标和数组下标一一对应 docInfo.setDocID(forwardIndex.size()); forwardIndex.add(docInfo); return docInfo; }

在倒排索引中新增文档
1.需要统计每一个词在文档中的出现次数，在根据次数算出权重
2.首先进行分词操作，统计每一个不同的词在标题中出现的次数
3.再进行分词操作，统计每一个词在正文出现的次数
4.设置权重为标题次数*10+正文次数

 private void buildInverted(DocInfo docInfo) { class WordCnt{ public int titleCount; public int contentCount; } HashMap<String,WordCnt> wordCntHashMap=new HashMap<>(); //1.针对标题进行分词操作 List<Term> terms= ToAnalysis.parse(docInfo.getTitle()).getTerms(); //2.针对分词结果，统计每个词出现的次数 for (Term term:terms){ String word=term.getName(); WordCnt wordCnt=wordCntHashMap.get(word); if (wordCnt==null){ WordCnt newwordCnt=new WordCnt(); newwordCnt.titleCount=1; newwordCnt.contentCount=0; wordCntHashMap.put(word,newwordCnt); }else { wordCnt.titleCount+=1; } } //3.针对正文进行分词操作 List<Term> terms2=ToAnalysis.parse(docInfo.getContent()).getTerms(); //4.遍历分词结果，统计每个词出现的次数 for (Term term:terms2){ String word=term.getName(); WordCnt wordCnt=wordCntHashMap.get(word); if (wordCnt==null){ WordCnt newWordCnt=new WordCnt(); newWordCnt.titleCount=0; newWordCnt.contentCount=1; wordCntHashMap.put(word,newWordCnt); }else { wordCnt.contentCount+=1; } } //5.设置权重为：标题*10+正文 //一个对象必须实现了Iterable接口才能使用for each进行遍历，而Map并没有实现该接口，但Set实现了，所以就把Map转换为Set for(Map.Entry<String,WordCnt> entry:wordCntHashMap.entrySet()) {  List<Weight> invertedList=invertedIndex.get(entry.getKey()); if (invertedList==null){  ArrayList<Weight> newInvertedList=new ArrayList<>();  Weight weight=new Weight();  weight.setWeight(entry.getValue().titleCount*10+entry.getValue().contentCount);  weight.setDocId(docInfo.getDocID());  newInvertedList.add(weight);  invertedIndex.put(entry.getKey(),newInvertedList); }else {  Weight weight=new Weight();  weight.setDocId(docInfo.getDocID());  weight.setWeight(entry.getValue().titleCount*10+entry.getValue().contentCount);  invertedList.add(weight); } } }

2.4内存索引保存到磁盘

索引当前是存储在内存中的，构造索引的过程是非常耗时的，因此我们就不应该再服务器启动时才去构造索引，通常就把这些耗时的操作，单独执行完成之后，然后再让线上的服务器加载构造好的索引。
我们就把内存中构造的索引结构，给变成一个字符串，然后写入文件即可，这个操作就叫序列化。适应Jackson中的ObjectMapper来完成此操作。

 private static String INDEX_PATH=\"D:/doc_searcher_index/\"; public void save(){ long beg=System.currentTimeMillis(); System.out.println(\"保存索引开始！\"); File indexPathFile=new File(INDEX_PATH); if(!indexPathFile.exists()){ indexPathFile.mkdir(); } File forwardIndexFile=new File(INDEX_PATH+\"forward.txt\"); File invertedIndexFile=new File(INDEX_PATH+\"inverted.txt\"); try { //利用ObjectMapperJava对象转换为JSON格式 //从内存中读取forwardIndex保存到forwardIndexFile objectMapper.writeValue(forwardIndexFile,forwardIndex); //从内存中读取nvertedIndex保存到invertedIndexFile forwardIndexFileobjectMapper.writeValue(invertedIndexFile,invertedIndex); }catch (IOException e) { e.printStackTrace(); } long end=System.currentTimeMillis(); System.out.println(\"保存索引完成！消耗时间：\"+(end-beg)+\"ms\"); }

2.5把磁盘索引加载到内存

public void load(){ long beg=System.currentTimeMillis(); System.out.println(\"加载索引开始！\"); File forwardIndexFile=new File(INDEX_PATH+\"forward.txt\"); File invertedIndexFile=new File(INDEX_PATH+\"inverted.txt\"); try { forwardIndex=objectMapper.readValue(forwardIndexFile, new TypeReference<ArrayList<DocInfo>>() {}); invertedIndex=objectMapper.readValue(invertedIndexFile, new TypeReference<HashMap<String, ArrayList<Weight>>>() {}); }catch (IOException e){ e.printStackTrace(); } long end=System.currentTimeMillis(); System.out.println(\"加载引擎结束消耗时间：\"+(end-beg)+\"ms\"); }

parser相当于制作索引的入口，对应到一个可执行的程序
index相当于实现了索引的数据结构，提供一些Api
接下来我们就在parser里面调用对应的api
在parser类中解析完Html文件时，应添加到索引中

private void parseHTML(File f) { //1.解析HTML标题 String title=parseTitle(f); //2.解析HTML的URL String url=parseUrl(f); //3.解析HTML的正文 long beg=System.nanoTime(); //String content=parseContent(f); String content=parseContentByRegex(f); long mid=System.nanoTime(); //把解析出来的信息加载到索引 index.addDoc(title,url,content); }

在添加完索引之后，应该把索引保存到磁盘

public void run() { long beg=System.currentTimeMillis(); System.out.println(\"索引制作开始\"); //1.枚举出INPUT_PATH下的所有html文件 ArrayList<File> fileList=new ArrayList<>(); enumFile(INPUT_PATH,fileList); //2.解析文档内容 for (File f:fileList){ System.out.println(\"开始解析\"+f.getAbsolutePath()+\"....\"); parseHTML(f); } //3.把内存构造的索引保存到磁盘 index.save(); long end=System.currentTimeMillis(); System.out.println(\"索引制作结束！消耗时间：\"+(end-beg)+\"ms\"); }

3.性能优化

此时我们已经完成了文档解析和索引制作模块，那么我们进行验证
【项目日记（二）】搜索引擎-索引制作_索引编写
文档内容正确生成：

但我们观察索引制作的时间：一个消耗了19973ms就是19s，花费的时间是比较长的，那么有什么办法提高效率呢？方法当然是有的，首先我们得清楚具体是哪一个步骤拖慢了执行效率，我们来分析代码：
【项目日记（二）】搜索引擎-索引制作_索引编写
可以看到解析文档的时候从磁盘读文件，循环遍历文件操作，那么显然效率是非常慢的，既然一个线程串行执行效率非常慢，那么我们就采用多线程并发执行来提高效率。

3.1多线程

我们可以使用创建一个线程池来实现并发操作。通过submit往线程池中提价任务，操作极快（只是把Runnable对象放入阻塞队列中）。
把代码改进成多线程的版本，线程池中的线程数目具体设置成多少才合适呢？最好通过实验来确定。

public void run() { long beg=System.currentTimeMillis(); System.out.println(\"索引制作开始\"); //1.枚举出INPUT_PATH下的所有html文件 ArrayList<File> fileList=new ArrayList<>(); enumFile(INPUT_PATH,fileList); ExecutorService executorService= Executors.newFixedThreadPool(6); //2.解析文档内容 for (File f:files){ executorService.submit(new Runnable() { @Override public void run() {  System.out.println(\"解析：\"+f.getAbsolutePath());  parseHTML(f);  } }); } //3.把内存构造的索引保存到磁盘 index.save(); long end=System.currentTimeMillis(); System.out.println(\"索引制作结束！消耗时间：\"+(end-beg)+\"ms\"); }

3.2线程安全

我们既然引入了多线程就要考虑线程安全问题，要注意修改操作和读写操作。当多个线程同时尝试修改同一个共享数据时，需要确保数据的一致性，避免出现竞态条件。读写操作：如果一个线程在读取共享数据的同时另一个线程在修改该数据，可能导致读取到不一致或无效的数据。
那么我们就需要对程序进行加锁操作：
【项目日记（二）】搜索引擎-索引制作_索引编写

3.3CountDownLatch类

添加锁虽然解决了线程安全问题，依然有新的问题，那就是在所有文件提交完成后就会立即执行save（）操作，但是可能文件解析还没有完成。为了解决这样的问题，我们就引入 CountDownLatch类。
CountDownLatch类类似于跑步比赛的裁判，只有所有的选手都撞线了，就认为这场比赛结束了。再构造 CountDownLatch的时候指定一下比赛选手的个数，每个选手撞线都要通知一下countDown（），通过await来等待所有的选手都撞线完毕才执行save（）操作。

public void runByThread(){ long beg=System.currentTimeMillis(); System.out.println(\"索引开始制作\"); //1.枚举出INPUT_PATH下的所有html文件 ArrayList<File> files=new ArrayList<>(); enumFile(INPUT_PATH,files); //2.解析文档内容 CountDownLatch latch=new CountDownLatch(files.size()); ExecutorService executorService= Executors.newFixedThreadPool(6); for (File f:files){ executorService.submit(new Runnable() { @Override public void run() {  System.out.println(\"解析：\"+f.getAbsolutePath());  parseHTML(f);  latch.countDown(); } }); } try { //await会阻塞，把所有选手都调用countDown以后才会继续执行 latch.await(); } catch (InterruptedException e) { throw new RuntimeException(e); } //手动的把线程池里面的线程杀掉 executorService.shutdown(); //3.把内存构造的索引保存到磁盘 index.save(); long end=System.currentTimeMillis(); System.out.println(\"索引制作结束！消耗时间：\"+(end-beg)+\"ms\"); System.out.println(\"t1:\"+t1+\"t2\"+t2); }

4.总结

这篇文章主要完成了索引制作模块，以及进行了性能优化，在下一篇文章中将进行搜索模块的制作。

下期预告：项目日记(三)

【项目日记（二）】搜索引擎-索引制作_索引编写

目录

1.前言

2.索引结构

2.1创捷索引

2.2根据索引查询

2.3新增文档

2.4内存索引保存到磁盘

2.5把磁盘索引加载到内存

3.性能优化

3.1多线程

3.2线程安全

3.3CountDownLatch类

4.总结

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

【项目日记（二）】搜索引擎-索引制作_索引 编写

目录

1.前言

2.索引结构

2.1创捷索引

2.2根据索引查询

2.3新增文档

2.4内存索引保存到磁盘

2.5把磁盘索引加载到内存

3.性能优化

3.1多线程

3.2线程安全

3.3CountDownLatch类

4.总结

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

【项目日记（二）】搜索引擎-索引制作_索引编写