【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建

技术文档

1、项目的相关背景

2.搜索引擎的相关宏观原理

3.搜索引擎技术栈和项目环境

4.正排索引vs倒排索引-搜索引擎具体原理

5.编写数据去标签与数据清洗的模块 Parser

5.1.去标签

目标：

5.2.代码的整体框架：

EnumFile函数的实现：

EnumFile测试结果

如何提取网页的url呢？

测试解析网页title，content，url是否正确？

6.编写建立索引的模块index

6.1.index模块的基本框架：

6.2.建立正派索引：

split()函数具体使用说明：

6.3.建立倒排索引：

7.编写搜索引擎模块searcher

7.1.基本代码框架：

7.2.建立摘要

问题：搜索结果出现重复文档的问题

7.3.修改后去重的代码：

7.4.测试：

8.编写 http_server 模块

9.简单的日志系统

10.前端代码

11.最终的测试结果（成品展示）

12.个人问题汇总：

问题一：

问题二：

问题三：

问题四：

问题五：

问题六：

问题七：

1、项目的相关背景

公司：百度、搜狗、360搜索、头条新闻客户端，我们自己实现是不可能的！
站内搜索：搜索的数据更垂直，数据量其实更小
boost的官网是没有站内搜索的，需要我们自己做⼀个

boost网站中是没有相关的搜索引擎的，我们自己实现一个！

boost 官网： https://www.boost.org/

我们使用最新的boost_1_86_0/doc/html⽬录下的html⽂件，⽤它来进⾏建⽴索引

2.搜索引擎的相关宏观原理

3.搜索引擎技术栈和项目环境

技术栈:C/C++ C++11, STL, 标准库Boost，Jsoncpp，cppjieba，cpp-httplib ,

选学： html5，css，js、jQuery、Ajax（前端）

项目环境： Ubuntu 9.4.0云服务器，vim/gcc(g++)/Makefile ,vs2022 or vscode

4.正排索引vs倒排索引-搜索引擎具体原理

文档1：雷军买了四斤小米
文档2：雷军发布了小米手机

正排索引：就是从⽂档ID找到⽂档内容(⽂档内的关键字)

目标文档进行分词（目的：方便建立倒排索引和查找）：

文档1[雷军买了四斤小米]: 雷军/买/四斤/小米/四斤小米
文档2[雷军发布了小米手机]:雷军/发布/小米/小米手机

停止词:了，的，吗，a，the，一般我们在分词的时候可以不考虑

倒排索引:根据文档内容，分词，整理不重复的各个关键字，对应联系到文档ID的方案。

模拟一次查找的过程:
用户输入:小米 ->倒排索引中查找->提取出文档ID(1,2)->根据正排索引->找到文档的内容 ->
title+conent(desc)+url 文档结果进行摘要->构建响应结果

倒排->正排->文档摘要

5.编写数据去标签与数据清洗的模块 Parser

5.1.去标签

我们首先需要将boost网站里的站内资源进行下载，并压缩到我们的项目当中，作为初始数据进行保存

什么是标签？

: html的标签，这个标签对我们进⾏搜索是没有价值的，需要去掉这些标签，⼀般标签都是成对出现的！

为什么要去标签？我们随便打开一个压缩好的网页资源，他是这样的：

大部分内容其实都是标签，对我们进行搜索是没有用的，所以我们要进行去标签。

目标：

把每个文档都去标签，然后写入到同一个文件中！每个文档内容不需要任何\\n！文档和文档之间用\\3 区分

parser文件的编写

5.2.代码的整体框架：

#include #include #include #include#include\"util.hpp\"const std::string src_path = \"data/input/\"; //这⾥放的是原始的html⽂档 const std::string output = \"data/raw_html/raw.txt\";//这是放的是去标签之后的⼲净⽂档 typedef struct DocInfo{ std::string title; // 文档的标题 std::string contnt; // 文档内容 std::string url; // 该文档在官网中的url} DocInfo_t;// const &: 输⼊//*: 输出//&：输⼊输出bool EnumFile(const std::string &src_path, std::vector*files_list);bool ParseHtml(const std::vector &files_list,std::vector *results);bool SaveHtml(const std::vector &results, const std::string &output);int main(){ std::vector files_list; // 第⼀步: 递归式的把每个html⽂件名带路径，保存到files_list中，⽅便后期进⾏⼀个⼀个的⽂件进⾏读取 if (!EnumFile(src_path, &files_list)) { std::cerr << \"enum file name error!\" << std::endl; return 1; } // 第⼆步: 按照files_list读取每个⽂件的内容，并进⾏解析 std::vector results; if (!ParseHtml(files_list, &results)) { std::cerr << \"parse html error\" << std::endl; return 2; } // 第三步: 把解析完毕的各个⽂件内容，写⼊到output,按照\\3作为每个⽂档的分割符 if (!SaveHtml(results, output)) { std::cerr << \"sava html error\" << std::endl; return 3; } return 0;}

EnumFile函数的实现：

bool EnumFile(const std::string &src_path, std::vector*files_list){ namespace fs = boost::filesystem; fs::path root_path(src_path); //判断路径是否存在，不存在，就没有必要往后走了 if(!fs::exists(root_path)) { std::cerr << src_path << \"not exists\" <path().extension() != \".html\")//判断文件名的后缀是否符合要求 { continue; } //std::cout << \"debug: \" <path().string() <push_back(iter->path().string());//将所有带路径的html保存在file_list，方便后续进行文本分析 } return true;}

EnumFile测试结果

如下，可以把所有的.html网页输出出来了

我们提取网页中的title和content都比较简单。

提取title是直接在网页内容中查找,然后进行字符串的截取即可。</p> <pre><code class="language-cpp">bool ParseTitle(const std::string &file, std::string *title){ std::size_t begin = file.find(\"<title>\"); if(begin == std::string::npos) { return false; } std::size_t end = file.find(\"\"); if(end == std::string::npos) { return false; } begin += std::string(\"\").size(); if(begin > end) { return false; } *title = file.substr(begin, end - begin); return true;}</code></pre> <p>提取content就是一个去标签的过程，我们这里采用的是基于简单的状态机进行去标签。</p> <pre><code class="language-cpp">bool ParseContent(const std::string &file, std::string *content){ //去标签，基于一个简易的状态机 enum status { LABLE, CONTENT }; enum status s = LABLE; for(char c : file) { switch(s) { case LABLE: if(c == \'>\') s = CONTENT; break; case CONTENT: if(c == \'push_back(c); } break; default: break; } } return true;}</code></pre> <h4 id="%E5%A6%82%E4%BD%95%E6%8F%90%E5%8F%96%E7%BD%91%E9%A1%B5%E7%9A%84url%E5%91%A2%EF%BC%9F">如何提取网页的url呢？</h4> <blockquote> <p><span>boost库的官方文档，和我们下载下来的文档，是有路径的对应关系的</span><br /> 官网URL样例:<br /> https://www.boost.org/doc/libs/1_86_0/doc/html/accumulators.html<br /> 我们下载下来的url样例:boost/1_86_0/doc/html/accumulators.html<br /> 我们拷贝到我们项目中的样例:data/input/accumulators.html //我们把下载下来的boost库doc/html/* copy data/input/<br /> url head =\"https://www.boost.org/doc/libs/1_86_0/doc/html\";</p> <p>url tail = [data/input](删除)/accumulators.html -> url tail =/accumulators.html<br /> <span><span>url = url_head + url_tail ;相当于形成了一个官网链接</span></span>！</p> </blockquote> <pre><code class="language-cpp">bool ParseUrl(const std::string &file_path ,std::string *url){ std::string url_head = \"https://www.boost.org/doc/libs/1_86_0/doc/html/\"; std::string url_tail = file_path.substr(src_path.size()); *url = url_head + url_tail; return true;}</code></pre> <p>将解析内容写入文件中：</p> <pre><code class="language-cpp">bool SaveHtml(const std::vector &results, const std::string &output){#define SEP \'\\3\' //按照二进制方式进行写入 std::ofstream out (output, std::ios::out | std::ios::binary); if(!out.is_open()) { std::cerr << \"open \" << output << \" failed!\" << std::endl; return false; } //可以开始进行文件内容的写入了 for(auto &item : results) { std::string out_string; out_string = item.title; out_string += SEP; out_string += item.contnt; out_string += SEP; out_string += item.url; out_string += \'\\n\'; out.write(out_string.c_str(), out_string.size()); } return true;}</code></pre> <h3 id="%E6%B5%8B%E8%AF%95%E8%A7%A3%E6%9E%90%E7%BD%91%E9%A1%B5title%EF%BC%8Ccontent%EF%BC%8Curl%E6%98%AF%E5%90%A6%E6%AD%A3%E7%A1%AE%EF%BC%9F">测试解析网页title，content，url是否正确？</h3> <p><img alt="" height="510" src="https://i-blog.csdnimg.cn/direct/198a0bf61b9345f481b03564f2d6baa5.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <p><img alt="" height="618" src="https://i-blog.csdnimg.cn/direct/69bc78f74d9d44e9b7a05077a5ea9ee9.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <blockquote> <p>vim data/input/mpi/history.html</p> </blockquote> <p>在自己下载的文件里面进行验证，发现正确，没问题！</p> <p><img alt="" height="969" src="https://i-blog.csdnimg.cn/direct/baa249c13bfa42d18c748a433fb5bd52.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <p>在网站中验证，也没问题！</p> <p>最后将测试将结果内容填充到raw.txt</p> <p><img alt="" height="376" src="https://i-blog.csdnimg.cn/direct/31067357beb142b58909d841e14e57ee.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <p id="">在Unix/Linux命令行环境中，<span><code>cat raw.txt | wc -l</code></span> 这条命令用于统计文件 <code>raw.txt</code> 中的行数。这里，命令通过管道 (<code>|</code>) 连接了两个独立的命令：<code>cat</code> 和 <code>wc -l</code>。</p> <ol> <li> <p id=""><strong><code>cat raw.txt</code></strong>：<code>cat</code> 是 concatenate（连接）的缩写，但在这里它通常被用来显示文件的内容。当你运行 <code>cat raw.txt</code> 时，<code>raw.txt</code> 文件的内容会被输出到标准输出（通常是终端或命令行界面）。</p> </li> <li> <p id=""><strong><code>|</code></strong>：管道符号用于将一个命令的输出作为另一个命令的输入。在这个例子中，<code>cat raw.txt</code> 的输出被传递给了 <code>wc -l</code>。</p> </li> <li> <p id=""><strong><code>wc -l</code></strong>：<code>wc</code> 是 word count（字数统计）的缩写，但配合 <code>-l</code> 选项时，它只会统计输入中的行数。因此，<code>wc -l</code> 会读取从管道传来的数据，并计算其中的行数。</p> </li> </ol> <p id="">综上所述，<code>cat raw.txt | wc -l</code> 这条命令的作用是读取 <code>raw.txt</code> 文件的内容，并统计其中的行数，然后将行数输出到终端。然而，对于简单的文件读取和行数统计任务，通常可以直接使用 <code>wc -l raw.txt</code> 命令，这样更加简洁高效，因为 <code>wc</code> 命令本身就能够读取文件并统计行数，无需通过 <code>cat</code> 命令先输出文件内容。</p> <h2 id="6.%E7%BC%96%E5%86%99%E5%BB%BA%E7%AB%8B%E7%B4%A2%E5%BC%95%E7%9A%84%E6%A8%A1%E5%9D%97index">6.编写建立索引的模块index</h2> <h3 id="6.1.index%E6%A8%A1%E5%9D%97%E7%9A%84%E5%9F%BA%E6%9C%AC%E6%A1%86%E6%9E%B6%EF%BC%9A">6.1.index模块的基本框架：</h3> <pre><code class="language-cpp">#pragma once#include #include #include #include #include #include #include \"util.hpp\"namespace ns_index{ struct DocInfo { std::string title; // 文档的标题 std::string content; // 文档对应的去标签之后的内容 std::string url; // 官网文档url uint64_t doc_id; // 文档的ID，暂时先不做过多理解 }; struct InvertedElem { uint64_t doc_id; std::string word; int weight; // 权重 }; // 倒排拉链 typedef std::vector InvertedList; class Index { private: // 正排索引的数据结构用数组，数组的下标天然是文档的ID std::vector forward_index; // 正排索引 // 倒排索引一定是一个关键字和一组（个）InvertedElem对应[关键字和倒排拉链的映射关系] std::unordered_map inverted_index; // 倒排索引 private: Index() {} // 但是一定要有函数体，不能delete Index(const Index &) = delete; Index &operator=(const Index &) = delete; static Index *instance; static std::mutex mtx; public: ~Index() {} public: static Index *GetInstance() { if (nullptr == instance) { mtx.lock(); if (nullptr == instance) { instance = new Index(); } mtx.unlock(); } return instance; } public: // 根据doc_id找到文档内容 DocInfo *GetForwardIndex(uint64_t doc_id) { if (doc_id >= forward_index.size()) { std::cerr << \"doc_id out range, error!\" << std::endl; return nullptr; } return &forward_index[doc_id]; } // 根据关键字string，获得倒排拉链 InvertedList *GetInvertedList(const std::string &word) { auto iter = inverted_index.find(word); if (iter == inverted_index.end()) { std::cerr << word << \" have no InvertedList\" <second); } // 根据去标签，格式化之后的文档，构建正排和倒排索引 bool BuildIndex(const std::string &input) // parse处理完毕的数据交给我 { std::ifstream in(input, std::ios::in | std::ios::binary); if (!in.is_open()) { std::cerr << \"sorry, \" << input << \" open error\" << std::endl; return false; } std::string line; int count = 0; while (std::getline(in, line)) { DocInfo *doc = BuildForwardIndex(line); if (nullptr == doc) { std::cerr << \"build \" << line << \" error\" << std::endl; // for deubg continue; } BuildInvertedIndex(*doc); count++; if(count % 50 == 0){ std::cout <<\"当前已经建立的索引文档: \" << count <<std::endl; // LOG(NORMAL, \"当前的已经建立的索引文档: \" + std::to_string(count)); } } return true; }</code></pre> <h3 id="6.2.%E5%BB%BA%E7%AB%8B%E6%AD%A3%E6%B4%BE%E7%B4%A2%E5%BC%95%EF%BC%9A">6.2.建立正派索引：</h3> <pre><code class="language-cpp">DocInfo *BuildForwardIndex(const std::string &line) { // 1. 解析line，字符串切分 // line -> 3 string, title, content, url std::vector results; const std::string sep = \"\\3\"; // 行内分隔符 ns_util::StringUtil::Split(line, &results, sep); // ns_util::StringUtil::CutString(line, &results, sep); if (results.size() != 3) { return nullptr; } // 2. 字符串进行填充到DocIinfo DocInfo doc; doc.title = results[0]; // title doc.content = results[1]; // content doc.url = results[2]; /// url doc.doc_id = forward_index.size(); // 先进行保存id，在插入，对应的id就是当前doc在vector中的下标! // 3. 插入到正排索引的vector forward_index.push_back(std::move(doc)); // doc,html文件内容 return &forward_index.back(); }</code></pre> <p>这里正排索引在切分字符串的时候，我采用了<span>boost库中的split函数</span></p> <pre><code class="language-cpp"> class StringUtil { public: static void Split(const std::string &target, std::vector *out, const std::string &sep) { // boost split boost::split(*out, target, boost::is_any_of(sep), boost::token_compress_on); } };</code></pre> <h3 id="split()%E5%87%BD%E6%95%B0%E5%85%B7%E4%BD%93%E4%BD%BF%E7%94%A8%E8%AF%B4%E6%98%8E%EF%BC%9A">split()函数具体使用说明：</h3> <p>boost 库中split函数用来字符串的切割</p> <p>引用的头文件 </p> <p>boost::split()函数用于切割string字符串，将切割之后的字符串放到一个std::vector 之中；</p> <p>有4个参数：</p> <p>以boost::split(type, select_list, boost::is_any_of(\",\"), boost::token_compress_on);</p> <p>(1)、type类型是std::vector，用于存放切割之后的字符串</p> <p>(2)、select_list：传入的字符串，可以为空。</p> <p>(3)、boost::is_any_of(\",\")：设定切割符为,(逗号)</p> <p>(4)、 boost::token_compress_on：<span>将连续多个分隔符默认<strong>为压缩一个</strong></span>！默认没有打开，当用的时候一般是要打开的。</p> <p>测试代码：</p> <p><img alt="" height="372" src="https://i-blog.csdnimg.cn/direct/469a3d6335424a12b786766cd15c64b9.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <p>最后输出就是三部分，没有空格！</p> <h3 id="6.3.%E5%BB%BA%E7%AB%8B%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95%EF%BC%9A">6.3.建立倒排索引：</h3> <p>需要对 title && content都要先分词 --使⽤jieba分词，并且搜索内容不区分大小写，统一变成小写。</p> <p> 使用jieba的时候有一个坑，需要我们手动将limonp这个头文件拷贝到include头文件当中，不然编译会报错！</p> <p><img alt="" height="426" src="https://i-blog.csdnimg.cn/direct/c2ae91be8c8042f293b10e30d2036ad2.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <p><img alt="" height="365" src="https://i-blog.csdnimg.cn/direct/25501077caf4471691d7cff6d8bb5b68.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <pre><code class="language-cpp"> bool BuildInvertedIndex(const DocInfo &doc) { // DocInfo{title, content, url, doc_id} // word -> 倒排拉链 struct word_cnt { int title_cnt; int content_cnt; word_cnt() : title_cnt(0), content_cnt(0) {} }; std::unordered_map word_map; // 用来暂存词频的映射表 // 对标题进行分词 std::vector title_words; ns_util::JiebaUtil::CutString(doc.title, &title_words); // if(doc.doc_id == 1572){ // for(auto &s : title_words){ // std::cout << \"title: \" << s << std::endl; // } // } // 对标题进行词频统计 for (std::string s : title_words) { boost::to_lower(s); // 需要统一转化成为小写 word_map[s].title_cnt++; // 如果存在就获取，如果不存在就新建 } // 对文档内容进行分词 std::vector content_words; ns_util::JiebaUtil::CutString(doc.content, &content_words); // if(doc.doc_id == 1572){ // for(auto &s : content_words){ // std::cout << \"content: \" << s << std::endl; // } // } // 对内容进行词频统计 for (std::string s : content_words) { boost::to_lower(s); word_map[s].content_cnt++; }#define X 10#define Y 1 // Hello,hello,HELLO for (auto &word_pair : word_map) { InvertedElem item; item.doc_id = doc.doc_id; item.word = word_pair.first; item.weight = X * word_pair.second.title_cnt + Y * word_pair.second.content_cnt; // 相关性 InvertedList &inverted_list = inverted_index[word_pair.first]; inverted_list.push_back(std::move(item)); } return true; } };</code></pre> <h2 id="7.%E7%BC%96%E5%86%99%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E%E6%A8%A1%E5%9D%97searcher">7.编写搜索引擎模块searcher</h2> <h3 id="7.1.%E5%9F%BA%E6%9C%AC%E4%BB%A3%E7%A0%81%E6%A1%86%E6%9E%B6%EF%BC%9A">7.1.基本代码框架：</h3> <pre><code class="language-cpp">#include \"index.hpp\"namespace ns_searcher{ class Searcher{ private: ns_index::Index *index; //供系统进⾏查找的索引 public: Searcher(){} ~Searcher(){} public: void InitSearcher(const std::string &input){ //1. 获取或者创建index对象 //2. 根据index对象建⽴索引 } //query: 搜索关键字 //json_string: 返回给⽤⼾浏览器的搜索结果 void Search(const std::string &query, std::string *json_string) { //1.[分词]:对我们的query进⾏按照searcher的要求进⾏分词 //2.[触发]:就是根据分词的各个\"词\"，进⾏index查找 //3.[合并排序]：汇总查找结果，按照相关性(weight)降序排序 //4.[构建]:根据查找出来的结果，构建json串 -- jsoncpp } };}</code></pre> <h3 id="7.2.%E5%BB%BA%E7%AB%8B%E6%91%98%E8%A6%81">7.2.建立摘要</h3> <p>为什么要建立摘要？<br /> 因为我们正常在搜索引擎搜到的内容，是不可能将网页的一整个内容显示给客户的，一定要将网页的摘要返回给客户，相当于提炼出主旨，那我们怎么实现呢？<br /> 找到word在html_content中的首次出现，然后往前找50字节(如果没有，从begin开始)，往后找100字节(如果没有，到end就可以的)</p> <p><img alt="" height="747" src="https://i-blog.csdnimg.cn/direct/418108a65dd94dcf9834e9e37d75397b.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <p>注意定义start和end双指针的时候，要注意size_t类型与int类型的符号比较，很容易出错！</p> <ol> <li>由于size_t是无符号类型，如果使用不当（比如使用负数做运算），可能会导致意想不到的结果。例如，将负数赋值给size_t会导致它变成一个很大的正数。</li> </ol> <p>代码：</p> <pre><code class="language-cpp">std::string GetDesc(const std::string &html_content, const std::string &word) { //找到word在html_content中的首次出现，然后往前找50字节(如果没有，从begin开始)，往后找100字节(如果没有，到end就可以的) //截取出这部分内容 const int prev_step = 50; const int next_step = 100; //1. 找到首次出现 //不能使用find查找，可能因为大小写不匹配而报错 auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y){ return (std::tolower(x) == std::tolower(y)); }); if(iter == html_content.end()){ return \"None1\"; } int pos = std::distance(html_content.begin(), iter); //2. 获取start，end , std::size_t 无符号整数 int start = 0; int end = html_content.size() - 1; //如果之前有50+字符，就更新开始位置 if(pos > start + prev_step) start = pos - prev_step; if(pos = end) return \"None2\"; std::string desc = html_content.substr(start, end - start); desc += \"...\"; return desc; }</code></pre> <h4 id="%E9%97%AE%E9%A2%98%EF%BC%9A%E6%90%9C%E7%B4%A2%E7%BB%93%E6%9E%9C%E5%87%BA%E7%8E%B0%E9%87%8D%E5%A4%8D%E6%96%87%E6%A1%A3%E7%9A%84%E9%97%AE%E9%A2%98">问题：搜索结果出现重复文档的问题</h4> <p>比如我们在搜索“你是一个好人”时，jieba会将该语句分解为你/一个/好人/一个好人，在建立图的时候，可能会指向同一个文档，导致我们在搜索的时候会出现重复的结果。</p> <p>现象：<br /> <img alt="" height="535" src="https://i-blog.csdnimg.cn/direct/33e0f6222bcb4909ae36b4c2eecf6190.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <p>我们将一个boost库中的文档修改内容为“你是一个好人”，我们在搜索你是一个好人的时候就会出现重复结果：</p> <p><img alt="" height="1059" src="https://i-blog.csdnimg.cn/direct/fcad34f49d3b493fac2c98c365beb103.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <p>所以我们要做去重操作，如何判断相同呢？直接看文档id即可。并且要将权值修改，我们应该将搜索到的相同内容进行权值的累加，作为该文档的真正权值！</p> <p>去重之后的效果：</p> <p><img alt="" height="967" src="https://i-blog.csdnimg.cn/direct/9d0279f2c7d24eafa4d93bedef31c146.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <h3 id="7.3.%E4%BF%AE%E6%94%B9%E5%90%8E%E5%8E%BB%E9%87%8D%E7%9A%84%E4%BB%A3%E7%A0%81%EF%BC%9A">7.3.修改后去重的代码：</h3> <pre><code class="language-cpp">#pragma once#include \"index.hpp\"#include \"util.hpp\"#include \"log.hpp\"#include #include #include namespace ns_searcher{ struct InvertedElemPrint{ uint64_t doc_id; int weight; std::vector words; InvertedElemPrint():doc_id(0), weight(0){} }; class Searcher { private: ns_index::Index *index; //供系统进行查找的索引 public: Searcher(){} ~Searcher(){} public: void InitSearcher(const std::string &input) { //1. 获取或者创建index对象 index = ns_index::Index::GetInstance(); std::cout << \"获取index单例成功...\" <BuildIndex(input); std::cout << \"建立正排和倒排索引成功...\" << std::endl; //LOG(NORMAL, \"建立正排和倒排索引成功...\"); } //query: 搜索关键字 //json_string: 返回给用户浏览器的搜索结果 void Search(const std::string &query, std::string *json_string) { //1.[分词]:对我们的query进行按照searcher的要求进行分词 std::vector words; ns_util::JiebaUtil::CutString(query, &words); //2.[触发]:就是根据分词的各个\"词\"，进行index查找,建立index是忽略大小写，所以搜索，关键字也需要 //ns_index::InvertedList inverted_list_all; //内部InvertedElem std::vector inverted_list_all; std::unordered_map tokens_map; for(std::string word : words){ boost::to_lower(word); ns_index::InvertedList *inverted_list = index->GetInvertedList(word); if(nullptr == inverted_list){ continue; } //不完美的地方：你/是/一个/好人 100 //inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end()); for(const auto &elem : *inverted_list){ auto &item = tokens_map[elem.doc_id]; //[]:如果存在直接获取，如果不存在新建 //item一定是doc_id相同的print节点 item.doc_id = elem.doc_id; item.weight += elem.weight; item.words.push_back(elem.word); } } for(const auto &item : tokens_map){ inverted_list_all.push_back(std::move(item.second)); } //3.[合并排序]：汇总查找结果，按照相关性(weight)降序排序 //std::sort(inverted_list_all.begin(), inverted_list_all.end(),\\ // [](const ns_index::InvertedElem &e1, const ns_index::InvertedElem &e2){ // return e1.weight > e2.weight; // }); std::sort(inverted_list_all.begin(), inverted_list_all.end(),\\ [](const InvertedElemPrint &e1, const InvertedElemPrint &e2){ return e1.weight > e2.weight; }); //4.[构建]:根据查找出来的结果，构建json串 -- jsoncpp --通过jsoncpp完成序列化&&反序列化 Json::Value root; for(auto &item : inverted_list_all){ ns_index::DocInfo * doc = index->GetForwardIndex(item.doc_id); if(nullptr == doc){ continue; } Json::Value elem; elem[\"title\"] = doc->title; elem[\"desc\"] = GetDesc(doc->content, item.words[0]); //content是文档的去标签的结果，但是不是我们想要的，我们要的是一部分 TODO elem[\"url\"] = doc->url; //for deubg, for delete elem[\"id\"] = (int)item.doc_id; elem[\"weight\"] = item.weight; //int->string root.append(elem); } Json::StyledWriter writer; //Json::FastWriter writer; *json_string = writer.write(root); } std::string GetDesc(const std::string &html_content, const std::string &word) { //找到word在html_content中的首次出现，然后往前找50字节(如果没有，从begin开始)，往后找100字节(如果没有，到end就可以的) //截取出这部分内容 const int prev_step = 50; const int next_step = 100; //1. 找到首次出现 //不能使用find查找，可能因为大小写不匹配而报错 auto iter = std::search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y){ return (std::tolower(x) == std::tolower(y)); }); if(iter == html_content.end()){ return \"None1\"; } int pos = std::distance(html_content.begin(), iter); //2. 获取start，end , std::size_t 无符号整数 int start = 0; int end = html_content.size() - 1; //如果之前有50+字符，就更新开始位置 if(pos > start + prev_step) start = pos - prev_step; if(pos = end) return \"None2\"; std::string desc = html_content.substr(start, end - start); desc += \"...\"; return desc; } };}</code></pre> <h3 id="7.4.%E6%B5%8B%E8%AF%95%EF%BC%9A">7.4.测试：</h3> <p>打出来的是不是按权值进行排序的呢？我们可以将weight打印出来看看</p> <p><img alt="" height="1200" src="https://i-blog.csdnimg.cn/direct/aca7c6e5a6b14fd5a64e554b07aba436.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <p>最大是16 ，最小是1，我们打开网站自己验证一下</p> <p><img alt="" height="1200" src="https://i-blog.csdnimg.cn/direct/0f8d3248e8974879b7f182f063847c0c.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <p>这是16的，在文章内容中一共出现了16次，下面是1次的</p> <p><img alt="" height="1200" src="https://i-blog.csdnimg.cn/direct/80cb3fb8ae6d4191894551cc8ee279ad.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="1200" /></p> <p>一共出现1次正确！！！</p> <h2 id="8.%E7%BC%96%E5%86%99%20http_server%20%E6%A8%A1%E5%9D%97">8.编写 http_server 模块</h2> <p>我们这里不用自己去搭建轮子，直接用网上的cpp-httplib库即可搭建网络通信。</p> <p> httpserver的基本测试代码：</p> <pre><code class="language-cpp">#include\"httplib.h\"int main(){ httplib::Server svr; svr.Get(\"/hi\", [](const httplib::Request &req, httplib::Response &rsp){ rsp.set_content(\"你好,世界!\", \"text/plain; charset=utf-8\"); }); svr.listen(\"0.0.0.0\",8085); return 0;}</code></pre> <p><img alt="" height="280" src="https://i-blog.csdnimg.cn/direct/6cc65e77e0d74bc68db8e390ce999a2d.png" alt="【Boost搜索引擎项目】构建Boost站内搜索引擎的技术实践与探索_boost構建" width="926" /></p> <p>没问题！</p> <p>所以我们只要会使用基本的接口即可</p> <h3>代码解释：</h3> <ol> <li> <p id=""><code>#include\"httplib.h\"</code>: 这行代码包含了<code>httplib</code>库的头文件，使得程序可以使用<code>httplib</code>库提供的功能。</p> </li> <li> <p id=""><code>int main() {</code>: 定义了程序的主函数，C++程序的执行从这里开始。</p> </li> <li> <p id=""><code>httplib::Server svr;</code>: 创建了一个<code>httplib::Server</code>类的实例<code>svr</code>。这个实例代表了一个HTTP服务器。</p> </li> <li> <p id=""><code>svr.Get(\"/hi\", [](const httplib::Request &req, httplib::Response &rsp){ ... });</code>: 这行代码为服务器注册了一个处理GET请求的路由。当访问<code>/hi</code>这个URL时，会执行后面的lambda函数。这个lambda函数接收两个参数：一个<code>const httplib::Request &req</code>（代表接收到的请求）和一个<code>httplib::Response &rsp</code>（代表要发送的响应）。</p> </li> <li> <p id=""><code>rsp.set_content(\"你好,世界!\", \"text/plain; charset=utf-8\");</code>: 在lambda函数内部，这行代码设置了响应的内容为\"你好,世界!\"，并指定了内容类型为纯文本（<code>text/plain</code>）以及字符集为UTF-8。这意味着当访问<code>/hi</code>时，服务器将返回这段文本。</p> </li> <li> <p id=""><code>svr.listen(\"0.0.0.0\",8085);</code>: 这行代码让服务器监听所有网络接口（<code>0.0.0.0</code>）上的8085端口。这意味着服务器可以接受来自任何IP地址的连接，只要它们连接的是8085端口。</p> </li> <li> <p id=""><code>return 0;</code>: 主函数返回0，通常表示程序正常结束。</p> </li> <li> <p id=""><code>}</code>: 主函数的结束括号。</p> </li> </ol> <pre><code class="language-cpp">#include \"httplib.h\"#include \"searcher.hpp\"const std::string input = \"data/raw_html/raw.txt\";const std::string root_path = \"./wwwroot\";int main(){ ns_searcher::Searcher search; search.InitSearcher(input); httplib::Server svr; svr.set_base_dir(root_path.c_str()); svr.Get(\"/s\", [&search](const httplib::Request &req, httplib::Response &rsp){ if(!req.has_param(\"word\")){ rsp.set_content(\"必须要有搜索关键字!\", \"text/plain; charset=utf-8\"); return; } std::string word = req.get_param_value(\"word\"); //std::cout << \"用户在搜索：\" << word << std::endl; LOG(NORMAL, \"用户搜索的: \" + word); std::string json_string; search.Search(word, &json_string); rsp.set_content(json_string, \"application/json\"); //rsp.set_content(\"你好,世界!\", \"text/plain; charset=utf-8\"); }); LOG(NORMAL, \"服务器启动成功...\"); svr.listen(\"0.0.0.0\", 8085); return 0;}</code></pre> <h2 id="9.%E7%AE%80%E5%8D%95%E7%9A%84%E6%97%A5%E5%BF%97%E7%B3%BB%E7%BB%9F">9.简单的日志系统</h2> <pre><code class="language-cpp">#pragma once#include #include #include #define NORMAL 1#define WARNING 2#define DEBUG 3#define FATAL 4#define LOG(LEVEL, MESSAGE) log(#LEVEL, MESSAGE, __FILE__, __LINE__)void log(std::string level, std::string message, std::string file, int line){ std::cout << \"[\" << level << \"]\" << \"[\" << time(nullptr) << \"]\" << \"[\" << message << \"]\" << \"[\" << file << \" : \" << line << \"]\" << std::endl;}</code></pre> <h2 id="10.%E5%89%8D%E7%AB%AF%E4%BB%A3%E7%A0%81">10.前端代码</h2> <p>因为我们的重点主要在于后端，所以前端的代码不讲解。</p> <p>原码：</p> <pre><code class="language-html"> <title>boost 搜索引擎 /* 去掉网页中的所有的默认内外边距，html的盒子模型 */ * { /* 设置外边距 */ margin: 0; /* 设置内边距 */ padding: 0; } /* 将我们的body内的内容100%和html的呈现吻合 */ html, body { height: 100%; } /* 类选择器.container */ .container { /* 设置div的宽度 */ width: 800px; /* 通过设置外边距达到居中对齐的目的 */ margin: 0px auto; /* 设置外边距的上边距，保持元素和网页的上部距离 */ margin-top: 15px; } /* 复合选择器，选中container 下的 search */ .container .search { /* 宽度与父标签保持一致 */ width: 100%; /* 高度设置为52px */ height: 52px; } /* 先选中input标签，直接设置标签的属性，先要选中， input：标签选择器*/ /* input在进行高度设置的时候，没有考虑边框的问题 */ .container .search input { /* 设置left浮动 */ float: left; width: 600px; height: 50px; /* 设置边框属性：边框的宽度，样式，颜色 */ border: 1px solid black; /* 去掉input输入框的有边框 */ border-right: none; /* 设置内边距，默认文字不要和左侧边框紧挨着 */ padding-left: 10px; /* 设置input内部的字体的颜色和样式 */ color: #CCC; font-size: 14px; } /* 先选中button标签，直接设置标签的属性，先要选中， button：标签选择器*/ .container .search button { /* 设置left浮动 */ float: left; width: 150px; height: 52px; /* 设置button的背景颜色，#4e6ef2 */ background-color: #4e6ef2; /* 设置button中的字体颜色 */ color: #FFF; /* 设置字体的大小 */ font-size: 19px; font-family:Georgia, \'Times New Roman\', Times, serif; } .container .result { width: 100%; } .container .result .item { margin-top: 15px; } .container .result .item a { /* 设置为块级元素，单独站一行 */ display: block; /* a标签的下划线去掉 */ text-decoration: none; /* 设置a标签中的文字的字体大小 */ font-size: 20px; /* 设置字体的颜色 */ color: #4e6ef2; } .container .result .item a:hover { text-decoration: underline; } .container .result .item p { margin-top: 5px; font-size: 16px; font-family:\'Lucida Sans\', \'Lucida Sans Regular\', \'Lucida Grande\', \'Lucida Sans Unicode\', Geneva, Verdana, sans-serif; } .container .result .item i{ /* 设置为块级元素，单独站一行 */ display: block; /* 取消斜体风格 */ font-style: normal; color: green; }

<!--

这是标题

这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要