> 技术文档 > 手把手教你实现文档搜索引擎_starteam 文档搜索

手把手教你实现文档搜索引擎_starteam 文档搜索


🏠大家好,我是Yui_💬
🍑如果文章知识点有错误的地方,请指正!和大家一起学习,一起进步👀
🚀如有不懂,可以随时向我提问,我会全力讲解~
🔥如果感觉博主的文章还不错的话,希望大家关注、点赞、收藏三连支持一下博主哦~!
🔥你们的支持是我创作的动力!
🧸我相信现在的努力的艰辛,都是为以后的美好最好的见证!
🧸人的心态决定姿态!
💬欢迎讨论:如有疑问或见解,欢迎在评论区留言互动。
👍点赞、收藏与分享:如觉得这篇文章对您有帮助,请点赞、收藏并分享!
🚀分享给更多人:欢迎分享给更多对编程感兴趣的朋友,一起学习!

文章目录

  • 项目介绍与演示
  • 1. 项目拆分
    • 1.1技术栈
  • 2. 获取网页内容
  • 3.内容解析与预处理
    • 3.1 保存`html`路径
    • 3.2 解析`html`文件
      • 3.2.1 提取`title`
      • 3.2.2 提取`content`
      • 3.2.3 构建url
    • 3.3 保存`html`文件内容
  • 4. 索引构建
    • 4.1 知识补充——正排索引
    • 4.2 知识补充——倒排索引
    • 4.3 创建`index`类
      • 4.3.1 实现`BuildIndex`函数
      • 4.3.2 实现`BuildForwardIndex`函数
      • 4.3.3 实现`BuildInvertedIndex`函数
      • 4.3.4 知识补充——cppjieba库
      • 4.3.5 工具类的补充——分词功能
      • 4.3.6 index类补充——终
  • 5. 检索与排序
    • 5.2 初始化
    • 5.2 实现`Search`函数
      • 5.2.1 实现内容节选`GetDesc`函数
  • 6. 简单Debug
  • 7. 将搜索引擎入网
    • 7.1 `cpp-httplib`库的引入
    • 7.2 编写`http_server.cc`文件
  • 8. 前端内容
    • 8.1 `html`部分
    • 8.2 `CSS`部分
    • 8.3 JS部分
  • 9.项目总结
    • 9.1 小吐槽
  • 10.项目源码

项目介绍与演示

搜索引擎的核心功能是帮助用户再海量的信息中快速、准确的找到所需的内容。
要实现这样的目标,搜索引擎需要具备以下技术:

  • 网页抓取
  • 内容解析与预处理
  • 索引构建
  • 检索与排序
  • 自然语言处理
    由于本文这只涉及boost的搜索,所以并不会用到网页的爬取这一技术。当然在现实中的搜索引擎肯定不止我们上面写的那些,搜索引擎是一个非常复杂的技术,本文也只是对搜索引擎的一个简单实现。
    下面我们来看看实现的效果吧
    手把手教你实现文档搜索引擎_starteam 文档搜索

感觉还是不错的,虽然现在boost库已经支持搜索功能了,但是其实我们这个更应该说是一个文档搜索引擎,如果你需要我们可以不搜索boost文档去搜索其他文档。

1. 项目拆分

为了实现boost搜索引擎,我们肯定也要有上面所写到的那些技术。
网页爬取:
不过对应网页内容的抓取,我们可以直接下载官方文档,这样就可以得到官方网页的内容了。
内容解析与预处理:
然后就是内容解析与预处理,我们解析官方文档中所有的html文件,这些都是网页会显示的内容。找到所有的html文件都就要把文件中的所有标题、正文、url内容都提取出来。
索引建立:
建立索引是为了将解析后的信息进行组织和存储,建立高效的索引结构,以便于快速检索
检索与排序:
当用户输入查询时,我们的搜索引擎必须在索引中快速找到相关的内容,并根据一定的算法对结果进行排序,确保最相关的信息优先展示给用户。
自然语言的处理:
理解和处理用户的查询意图,以及网页内容的语义信息,这需要进行用到分词技术。

1.1技术栈

技术栈:C/C++、C++11、STL、准标准库Boost、Jsoncpp、cppjieba、cpp-httplib、html5、css、js、jQuery、Ajax
项目环境:centos/ubuntu,vim、gcc(g++)、Makefile vscode`

2. 获取网页内容

进入boost官方网站
目前的版本是1.87
手把手教你实现文档搜索引擎_starteam 文档搜索

然后我们需要把下载的内容转移到linux环境中解压。
解压指令为:

tar -xzf boost_1_87_0.tar.gz

解压完毕后,我们需要创建一个目录boost_searcher,这个就是我们的工作目录,然后在该目录下面创建一个目录data/input该目录是用来存储数据的,所以我们需要把boost_1_87_0/doc/html/*内容都拷贝到data/input当中。
复制完后,你可以在data/input中看到以下文件
手把手教你实现文档搜索引擎_starteam 文档搜索

然后我们回到工作目录boost_searcher,创建一个parser.cpp文件,用于内容解析。

3.内容解析与预处理

内容解析与预处理我们必须要清楚要解析哪些内容,因为我是搜索引擎,最后搜索到的都是网页,所以我解析的内容肯定都是html文件。那么我的第一步就是把所有的html文件都找出来并保存,保存后我们再开始解析,在此之前我必须要讲讲html文件的格式。

<!DOCTYPE html><html lang=\"en\"><head> <meta charset=\"UTF-8\"> <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"> <title>Document</title></head><body> </body></html>

html文件都是有标签组成了,我们要解析的就是标签中的内容,那么我们就需要把所有的标签给去掉,这也就是解析的步骤。
最后就是预处理,我们会把解析完毕的各个文件写入到一个文件中。
总结为3步走(files_list output是后续代码中的变量名):

  1. 递归式的把每个html文件名带路径,保存到files_list,方便后期进行一个一个的文件读取。
  2. 按照`files_list读取每个文件的内容,并进行解析。
  3. 把解析完毕的各个文件的内容,写入到output中,按照\\3作为每个文档的分隔符。
    下面我会给出parser.cpp文件的基本格式,我们只需要把这个格式的功能补全就完成了该段代码的书写,当然对各个功能的书写,我依旧会各个攻破,并给出思路。
#include #include #include #include \"Log.hpp\"//记录所有的html网页const std::string src_path = \"data/input\";const std::string output = \"data/raw_html/raw.txt\";typedef struct DocInfo{ std::string title;//文档的标题 std::string content;//文档内容 std::string url;//该文档在官网中的url}DocInfo_t;/** *:cosnt &表示输入 *:* 表示输出 *:& 表示输入输出 * */bool EnumFile(const std::string& src_path,std::vector<std::string>*files_list);bool ParseHtml(const std::vector<std::string>& files_list,std::vector<DocInfo_t>* results);bool SaveHtml(const std::vector<DocInfo_t>&results,const std::string&output);int main(){ std::vector<std::string> files_list; //1.递归式地把每个html文件路径保存到files_list中,方便后期进行读取 if(!EnumFile(src_path,&files_list)) { LOG(FATAL,\"enum file name error [%d->%s]\",errno,strerror(errno)); exit(1); } //2.安装files_list读取每个文件的内容,并进行解析 std::vector<DocInfo_t> results; if(!ParseHtml(files_list,&results)) { LOG(FATAL,\"parse html error [%d->%s]\",errno,strerror(errno)); exit(2); } //3.把解析完的各个文件内容,写入到output中,按照\\3作为每个文档的分隔符 if(!SaveHtml(results,output)) { LOG(FATAL,\"save html error [%d->%s]\",errno,strerror(errno)); exit(3); } return 0;}

关于日志头文件,其实大家可以不用去管,这是我以前写的一个日志系统,这里就直接拿来用了,对整体影响不大。

3.1 保存html路径

思路:把所有以html为后缀的文件路径都保存下来,这样做我们需要用到文件系统的功能。
因为C++的文件系统库没有boost库的好用,所有在这里我们会用到boost库
如果你的系统没有安装boost且和笔者一样为Centos系统,可以执行以下命令:

sudo yum install boost boost-devel

安装完毕后,我们就可以引入头文件了。

#include 

步骤:先判断目标路径是否存在,如果存在我们开始递归遍历该路径下的所有内容,如果文件后缀为html我就记录下来。

bool EnumFile(const std::string&src_path,std::vector<std::string>*files_list){ namespace fs = boost::filesystem; fs::path root_path(src_path); //判断路径是否存在,不存在就直接返回 if(!fs::exists(root_path)) { LOG(FATAL,\"%s not exists \",src_path.c_str()); return false; } //定义一个空的迭代器,用来判断递归结束 fs::recursive_directory_iterator end; for(fs::recursive_directory_iterator iter(root_path);iter!=end;iter++) { //判断文件是否为普通文件,以及html文件(html文件都是普通文件) if(!fs::is_regular_file(*iter)) { continue; } if(iter->path().extension()!=\".html\") { continue; } //到这里的都是html文件,下面开始计入文件 LOG(DEBUG,\"%s\",iter->path().string().c_str()); files_list->push_back(iter->path().string()); } return true;}

通过debug,我们来看看是否保存成功
手把手教你实现文档搜索引擎_starteam 文档搜索
这里只截取了部分,因为实在太多了。后续我们就可以把这个debug代码注释了。

3.2 解析html文件

我们随便点开一个html文件
手把手教你实现文档搜索引擎_starteam 文档搜索
我们需要的内容就除了标签外的内容,就比如我用绿线划起来的内容,它是一个网页的标题,我们肯定需要,同时我们还需要网页的内容和url,这就是我需要解析出来的内容。
关于解析html内容,我们也可以分为4步走:

  1. 读取文件
  2. 解析指定文件,提取title
  3. 解析指定文件,提取content
  4. 解析指定文件,构建url
    在解决问题前,我们先构建一个工具类文件,将来用这个工具类在完成文件的读取,和分词。
    我们先写文件读取内容,后续再补充分词功能:
#pragma once#include #include #include #include \"Log.hpp\"namespace ns_util{ class FileUtil { public: static bool ReadFile(const std::string& file_path,std::string* out) { std::ifstream in(file_path,std::ios::in); if(!in.is_open()) { LOG(FATAL,\"open file [%d->%s]\",errno,strerror(errno)); return false; } std::string line; while(std::getline(in,line)) { *out+=line; } in.close(); return true; } };}

然后我们再给出解析html文件内容的结构:

bool ParseHtml(const std::vector<std::string>& files_list, std::vector<DocInfo_t>* results){ for(const std::string&file:files_list) { //1.读取文件 std::string result; if(!ns_util::FileUtil::ReadFile(file,&result)) { continue; } DocInfo_t doc; //2.解析指定的文件,提取title if(!ParseTitle(result,&doc.title)) { continue; } //3.解析指定文件,提取content if(!ParseContent(result,&doc.content)) { continue; } //4.提取url if(!ParseUrl(file,&doc.url)) { continue; } //走到这一定完成了解析任务,把文件的相关内容保存在doc里面。 results->push_back(std::move(doc));//使用move减少拷贝 } return true;}

了解结构后,我们逐个攻破。

3.2.1 提取title

想要提取title非常的简单,因为一个网页只会有一个title标签,格式也固定xxx</<title></code>,想提取出<code>xxx</code>只需要找到<code><title>和的下标,然后提取字串即可。

static bool ParseTitle(const std::string&file,std::string*title) { std::size_t begin = file.find(\"\"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>  <span class="token keyword">if</span><span class="token punctuation">(</span>begin <span class="token operator">==</span> std<span class="token double-colon punctuation">::</span>string<span class="token double-colon punctuation">::</span>npos<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> std<span class="token double-colon punctuation">::</span>size_t end <span class="token operator">=</span> file<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span><span class="token string">\"</title\"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> begin<span class="token operator">+=</span>std<span class="token double-colon punctuation">::</span><span class="token function">string</span><span class="token punctuation">(</span><span class="token string">\"<title>\"</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span><span class="token punctuation">(</span>begin<span class="token operator">></span>end<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token operator">*</span>title <span class="token operator">=</span> file<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>begin<span class="token punctuation">,</span>end<span class="token operator">-</span>begin<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> </code></pre>
<h4>3.2.2 提取<code>content</code></h4>
<p>想要提取<code>content</code>肯定就没有提取<code>title</code>那么简单了,因为<code>content</code>的标签众多,为此我们只能专注于<code>xxx</code>。发现了吗,只要是<code>>xxx<</code>间的就是<code>content</code>。所以我们可以写一个简单的状态机开完成任务。<br /> 利用简单的状态机让我们保证提取的是<code>><</code>间的内容。<br /> 枚举两种状态,一种为<code>LABLE</code>表示此时处于标签状态,<code>CONTENT</code>处于内容状态。<br /> 只要当遍历到<code>></code>就可以把标签状态转化为内容状态,然后再遍历到<code><</code>就转化为标签状态。</p>
<pre><code class="prism language-cpp"><span class="token keyword">static</span> <span class="token keyword">bool</span> <span class="token function">ParseContent</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span>file<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span>content<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//去标签,做一个简易的状态机</span> <span class="token keyword">enum</span> <span class="token class-name">status</span> <span class="token punctuation">{<!-- --></span> LABLE<span class="token punctuation">,</span> CONTENT <span class="token punctuation">}</span><span class="token punctuation">;</span> <span class="token keyword">enum</span> <span class="token class-name">status</span> s <span class="token operator">=</span> LABLE<span class="token punctuation">;</span><span class="token comment">//初始肯定为标签状态</span> <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">char</span> c<span class="token operator">:</span>file<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">switch</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">case</span> LABLE<span class="token operator">:</span> <span class="token keyword">if</span><span class="token punctuation">(</span>c <span class="token operator">==</span> <span class="token char">\'>\'</span><span class="token punctuation">)</span> s <span class="token operator">=</span> CONTENT<span class="token punctuation">;</span> <span class="token keyword">break</span><span class="token punctuation">;</span>  <span class="token keyword">case</span> CONTENT<span class="token operator">:</span> <span class="token keyword">if</span><span class="token punctuation">(</span>c <span class="token operator">==</span> <span class="token char">\'<\'</span><span class="token punctuation">)</span> s <span class="token operator">=</span> LABLE<span class="token punctuation">;</span> <span class="token keyword">else</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">if</span><span class="token punctuation">(</span>c <span class="token operator">==</span> <span class="token char">\'\\n\'</span><span class="token punctuation">)</span><span class="token comment">//不保留原始文件的`\\n`,为了在后续利用\'\\n\'作为文本的分隔符</span>  c <span class="token operator">=</span> <span class="token char">\' \'</span><span class="token punctuation">;</span> content<span class="token operator">-></span><span class="token function">push_back</span><span class="token punctuation">(</span>c<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">break</span><span class="token punctuation">;</span> <span class="token keyword">default</span><span class="token operator">:</span> <span class="token keyword">break</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span></code></pre>
<h4>3.2.3 构建url</h4>
<p>为什么是构建url呢,我们再次打开官网。<br /> 打开这个网页<code>https://www.boost.org/doc/libs/1_87_0/doc/html/move.html</code><br /> 我们拆分下这个链接,前面的<code>https://www.boost.org/doc/libs/1_87_0/doc/html</code>是固定的,后面的<code>/move.html</code>就是我们保存的html路径,所以我们就可以知道这个后我们就可以开始构建<code>url</code>了。</p>
<pre><code class="prism language-cpp"><span class="token keyword">static</span> <span class="token keyword">bool</span> <span class="token function">ParseUrl</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span>file_path<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span>url<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>string url_head <span class="token operator">=</span> <span class="token string">\"https://www.boost.org/doc/libs/1_87_0/doc/html\"</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string url_tail <span class="token operator">=</span> file_path<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>src_path<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token operator">*</span>url <span class="token operator">=</span> url_head <span class="token operator">+</span> url_tail<span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> </code></pre>
<h3>3.3 保存<code>html</code>文件内容</h3>
<p>保存文件要注意的只要记得加<code>\\3</code>充当分隔符</p>
<pre><code class="prism language-cpp"><span class="token keyword">bool</span> <span class="token function">SaveHtml</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>DocInfo_t<span class="token operator">></span><span class="token operator">&</span> results<span class="token punctuation">,</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> output<span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">SEP</span> <span class="token char">\'\\3\'</span></span> <span class="token comment">//按照二进制方式进行写入 </span> std<span class="token double-colon punctuation">::</span>ofstream <span class="token function">out</span><span class="token punctuation">(</span>output<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>out<span class="token operator">|</span>std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>binary<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>out<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token function">LOG</span><span class="token punctuation">(</span>FATAL<span class="token punctuation">,</span><span class="token string">\"open %s failed!\"</span><span class="token punctuation">,</span>output<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>  <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">//进行文件内容的写入</span> <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> item<span class="token operator">:</span>results<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>string out_string<span class="token punctuation">;</span> out_string <span class="token operator">=</span> item<span class="token punctuation">.</span>title<span class="token punctuation">;</span> out_string <span class="token operator">+=</span> SEP<span class="token punctuation">;</span>  out_string <span class="token operator">+=</span> item<span class="token punctuation">.</span>content<span class="token punctuation">;</span> out_string <span class="token operator">+=</span> SEP<span class="token punctuation">;</span> out_string <span class="token operator">+=</span> item<span class="token punctuation">.</span>url<span class="token punctuation">;</span> out_string <span class="token operator">+=</span> <span class="token char">\'\\n\'</span><span class="token punctuation">;</span> out<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>out_string<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>out_string<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>out<span class="token punctuation">.</span><span class="token function">close</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> </code></pre>
<p><img src="https://i-blog.csdnimg.cn/img_convert/cd3dd88b2afea52971762bd4bd768008.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250217234421" /><br /> 可以看到,<code>raw.txt</code>确实有了内容。</p>
<hr />
<p>接下来,我们开始编写,搜索引擎的搜索功能,这个功能就要涉及到- 索引构建、检索与排序、自然语言处理。</p>
<h2>4. 索引构建</h2>
<p>首先我们需要考虑的就是如何建立索引。<br /> 在建立前,还需要科普什么是正排索引,什么是倒排索引。</p>
<h3>4.1 知识补充——正排索引</h3>
<p>正排索引是按照文档ID组织数据的,每个文档记录其包含的关键词及相关信息。<br /> 结构:</p>
<pre><code class="prism language-txt">文档ID -> 关键词列表</code></pre>
<p>下面引用博客园情月的示例及解释:正排索引和倒排索引简单介绍<br /> 网页A中的内容片段:</p>
<blockquote>
<p>Tom is a boy.<br /> Tom is a student too.</p>
</blockquote>
<p>网页B中的内容片段:</p>
<blockquote>
<p>Jon work at school.<br /> Tom’s teacher is Jon.</p>
</blockquote>
<p>正排索引会记录每个关键词出现的次数,查找时会扫描表中的每个文档中字的信息,直到找到包含查询关键字的文档。<br /> 假设网页A的局部文档ID为TA,网页B的局部文档ID为TB,那么TA进行正排索引建立的表结构是下面这样的:<br /> <img src="https://i-blog.csdnimg.cn/img_convert/2888349e47f3de88e276cfc5db2a8274.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250218134736" /></p>
<p>从上面的介绍可以看出,正排是以<code>doc_id</code>作为索引的,但是在搜索的时候我们基本上都是用关键词来搜索。所以,试想一下,我们搜一个关键字(Tom),当100个网页的10个网页含有Tom这个关键字。但是由于是正排是doc id 作为索引的,所以我们不得不把100个网页都扫描一遍,然后找出其中含有Tom的10个网页。然后再进行rank,sort等。效率就比较低了。尤其当现在网络上的网页数已经远远超过亿这个数量后,这种方式现在并不适合作为搜索的依赖。<br /> 不过与之相比的是,正排这种模式容易维护。由于是采用doc 作为key来存储的,所以新增网页的时候,只要在末尾新增一个key,然后把词、词出现的频率和位置信息分析完成后就可以使用了。</p>
<hr />
<p><strong>优点</strong>:</p>
<ul>
<li>适用于<strong>按文档查找内容</strong>,如获取某个文档的所有关键词。</li>
<li><strong>数据存储简单</strong>,易于维护。</li>
</ul>
<p><strong>缺点</strong>:</p>
<ul>
<li><strong>查询效率低</strong>,如果要查找包含某个关键词的所有文档,需要遍历所有文档,耗时较长。</li>
<li>不适用于<strong>全文检索</strong>,因为查找某个关键词在哪些文档中出现是低效的。</li>
</ul>
<h3>4.2 知识补充——倒排索引</h3>
<p>倒排索引是按照关键词组织数据的,每个关键词对应包含该关键词的文档ID列表。<br /> 结构:</p>
<pre><code class="prism language-txt">关键词 -> [文档ID列表]</code></pre>
<p>假设:</p>
<blockquote>
<p>我的手机是苹果 出自文档1<br /> 你的手机是华为 出自文档2<br /> iPhone是手机 出自文档3</p>
</blockquote>
<p>那么就会存在以下的索引</p>
<pre><code class="prism language-txt">\"手机\" -> [文档1, 文档2, 文档3]\"苹果\" -> [文档1]\"华为\" -> [文档2]\"iPhone\" -> [文档3]</code></pre>
<p><strong>优点</strong>:</p>
<ul>
<li><strong>查询效率高</strong>,可以快速查找包含某个关键词的所有文档,适用于<strong>搜索引擎、全文检索</strong>。</li>
<li><strong>适用于大规模数据</strong>,是搜索引擎的核心数据结构。<br /> <strong>缺点</strong>:</li>
<li><strong>构建索引成本高</strong>,需要预处理文本数据,并建立反向映射关系。</li>
<li><strong>需要额外的存储空间</strong>,存储关键词到文档 ID 的映射。</li>
</ul>
<hr />
<thead>
<th>索引类型</th>
<th>适用场景</th>
<th>优势</th>
<th>劣势</th>
</thead>
<tbody>
<td><strong>正排索引</strong></td>
<td>需要根据文档获取关键词、元数据</td>
<td>构建简单,适用于小规模数据</td>
<td>查找包含某个关键词的文档效率低</td>
<td><strong>倒排索引</strong></td>
<td>搜索引擎、全文检索</td>
<td>查询效率高,适合快速检索</td>
<td>构建索引耗时,需要额外存储空间</td>
<td>通常在搜索引擎中,正排索引和倒排索引会同时使用</td>
<td></td>
<td></td>
<td></td>
</tbody>
<ol>
<li>构建倒排索引:用于快速查询关键词出现在哪些文档中。</li>
<li>使用正排索引:在检索出文档后,根据正排索引获取文档的其他信息。</li>
</ol>
<hr />
<h3>4.3 创建<code>index</code>类</h3>
<p>所以知道正排索引和倒排索引后我们应该怎么写代码呢?<br /> 我们需要构建一个<code>index</code>类,这个类需要具有创建正派索引和倒排索引的能力。<br /> 其中正排索引利用下标来标记存储文档,而倒排索引就是利用关键词来获取这些关键词出现的文档。<br /> 我们可以利用<code>vector</code>来存储文档作为正排索引,同样为了描述一个文档,我们还是需要创建一个结构体<code>DocInfo</code>。<br /> 然后就是倒排索引,为了实现倒排索引,我们可以利用倒哈希表的映射功能:</p>
<pre><code class="prism language-txt"> KEY VALUE\"手机\" -> [文档1, 文档2, 文档3]\"苹果\" -> [文档1]\"华为\" -> [文档2]\"iPhone\" -> [文档3]</code></pre>
<p>以关键词为key值,一组文档为value。<br /> 因为文档有许多嘛,我们肯定也需要用一个容器来存储,同时在创建一个结构体来描述文档,不过这个就和上面有所不同了,在这里我们不需要描述文档的所有内容,因为最后我们是需要根据<code>id</code>去正排索引里拿数据的,所以这个的文档属性<code>只需要有文档id word 和weight</code>即可。<br /> 最后得到正排的数据结构为:<code>std::vector forward_index</code><br /> 倒排的数据结构:<code>std::unordered_map Inverted_index</code><br /> 注意:我们涉及的<code>index</code>类是单例模式。为什么要设计为单例模式呢?<br /> 为了避免重复加载索引,节约内存。<br /> 下面给出<code>index</code>类的基本结构,后续就是在这个结构上进行补充:</p>
<pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">pragma</span> <span class="token expression">once</span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">\"util.hpp\"</span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">\"Log.hpp\"</span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token keyword">namespace</span> ns_index<span class="token punctuation">{<!-- --></span> <span class="token keyword">struct</span> <span class="token class-name">DocInfo</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>string title<span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string content<span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string url<span class="token punctuation">;</span> <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span> <span class="token comment">// 文档的id</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <span class="token keyword">struct</span> <span class="token class-name">InvertedElem</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string word<span class="token punctuation">;</span> <span class="token keyword">int</span> weight<span class="token punctuation">;</span> <span class="token function">InvertedElem</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span><span class="token function">weight</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <span class="token comment">//倒排拉链</span> <span class="token keyword">typedef</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElem<span class="token operator">></span> InvertedList<span class="token punctuation">;</span> <span class="token keyword">class</span> <span class="token class-name">Index</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">private</span><span class="token operator">:</span> <span class="token comment">//正排索引的数据结构直接用数组,数组的下标就是天然的文档id </span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>DocInfo<span class="token operator">></span> forward_index<span class="token punctuation">;</span><span class="token comment">//正排索引</span> <span class="token comment">//倒排索引一定是一个关键字和一组InvertedElem对应【关键词和倒排拉链的映射关系】</span> std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span> InvertedList<span class="token operator">></span> Inverted_index<span class="token punctuation">;</span> <span class="token keyword">private</span><span class="token operator">:</span><span class="token comment">//单例模式</span> <span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token function">Index</span><span class="token punctuation">(</span><span class="token keyword">const</span> Index<span class="token operator">&</span><span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token keyword">delete</span><span class="token punctuation">;</span> Index<span class="token operator">&</span> <span class="token keyword">operator</span> <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token keyword">const</span> Index<span class="token operator">&</span><span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token keyword">delete</span><span class="token punctuation">;</span> <span class="token keyword">public</span><span class="token operator">:</span> <span class="token operator">~</span><span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token keyword">public</span><span class="token operator">:</span> <span class="token keyword">static</span> Index<span class="token operator">*</span> <span class="token function">GetInstance</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> instance<span class="token punctuation">)</span><span class="token comment">//双重保险</span> <span class="token punctuation">{<!-- --></span> mtx<span class="token punctuation">.</span><span class="token function">lock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> instance<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span>  instance <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> mtx<span class="token punctuation">.</span><span class="token function">unlock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> instance<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">public</span><span class="token operator">:</span> <span class="token comment">//根据去标签,格式话后的文档,构建正排和倒排索引</span> <span class="token keyword">bool</span> <span class="token function">BuildIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> input<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token keyword">private</span><span class="token operator">:</span> <span class="token keyword">static</span> Index<span class="token operator">*</span> instance<span class="token punctuation">;</span> <span class="token keyword">static</span> std<span class="token double-colon punctuation">::</span>mutex mtx<span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> Index<span class="token operator">*</span> Index<span class="token double-colon punctuation">::</span>instance <span class="token operator">=</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>mutex Index<span class="token double-colon punctuation">::</span>mtx<span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre>
<h4>4.3.1 实现<code>BuildIndex</code>函数</h4>
<p>我们需要导入parser解析过的数据,数据存储在<code>/home/yui/boost_searcher/data/raw_html/raw.txt</code>.<br /> 将数据导入后开始建立索引,先建立正排索引,然后再建立倒排索引。</p>
<pre><code class="prism language-cpp"><span class="token keyword">bool</span> <span class="token function">BuildIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> input<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>ifstream <span class="token function">in</span><span class="token punctuation">(</span>input<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>in <span class="token operator">|</span> std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>binary<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>in<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token function">LOG</span><span class="token punctuation">(</span>FATAL<span class="token punctuation">,</span><span class="token string">\"input %s failed\"</span><span class="token punctuation">,</span>input<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> std<span class="token double-colon punctuation">::</span>string line<span class="token punctuation">;</span> <span class="token keyword">int</span> count <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token keyword">while</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">getline</span><span class="token punctuation">(</span>in<span class="token punctuation">,</span>line<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> DocInfo<span class="token operator">*</span> doc <span class="token operator">=</span> <span class="token function">BuildForwardIndex</span><span class="token punctuation">(</span>line<span class="token punctuation">)</span><span class="token punctuation">;</span>  <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> doc<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span>  <span class="token function">LOG</span><span class="token punctuation">(</span>WARNING<span class="token punctuation">,</span><span class="token string">\"bulid %s failed\"</span><span class="token punctuation">,</span>line<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>  <span class="token keyword">continue</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token function">BuildInvertedIndex</span><span class="token punctuation">(</span><span class="token operator">*</span>doc<span class="token punctuation">)</span><span class="token punctuation">;</span> count<span class="token operator">++</span><span class="token punctuation">;</span> <span class="token keyword">if</span><span class="token punctuation">(</span>count<span class="token operator">%</span><span class="token number">100</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span>  <span class="token function">LOG</span><span class="token punctuation">(</span>DEBUG<span class="token punctuation">,</span><span class="token string">\"create index success num:%d\"</span><span class="token punctuation">,</span>count<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> <span class="token punctuation">}</span></code></pre>
<p>下面具体实现正排索引与倒排索引</p>
<h4>4.3.2 实现<code>BuildForwardIndex</code>函数</h4>
<p>实现这个正排索引,其实还是很简单的,比倒排肯定简单多了。<br /> 为了实现这个功能,我需要对字符串进行切分,还记得当初我给文档字符串添加的分割符<code>\\3</code>这也就是我们切割文档的关键,我们需要切割出文档的<code>title\\content\\url</code><br /> 为了切割字符串,我们可以使用<code>boost</code>库做到一键切除。同时把切割字符串这个功能封装到工具类中。</p>
<pre><code class="prism language-cpp"><span class="token comment">//记得添加头文件#include </span><span class="token keyword">class</span> <span class="token class-name">StringUtil</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">public</span><span class="token operator">:</span> <span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">Split</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> target<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span><span class="token operator">*</span>out<span class="token punctuation">,</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> sep<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//利用boost库 </span> boost<span class="token double-colon punctuation">::</span><span class="token function">split</span><span class="token punctuation">(</span><span class="token operator">*</span>out<span class="token punctuation">,</span>target<span class="token punctuation">,</span>boost<span class="token double-colon punctuation">::</span><span class="token function">is_any_of</span><span class="token punctuation">(</span>sep<span class="token punctuation">)</span><span class="token punctuation">,</span>boost<span class="token double-colon punctuation">::</span>token_compress_on<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> </code></pre>
<p>简单介绍下<code>boost::split</code>这个函数吧<br /> 下面引用GPT的解释,感觉不错:<br /> 函数原型:</p>
<pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token keyword">void</span> <span class="token function">split</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span><span class="token operator">&</span> result<span class="token punctuation">,</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> input<span class="token punctuation">,</span> <span class="token keyword">const</span> boost<span class="token double-colon punctuation">::</span>algorithm<span class="token double-colon punctuation">::</span>detail<span class="token double-colon punctuation">::</span>is_any_ofF<span class="token operator"><</span><span class="token keyword">char</span><span class="token operator">></span><span class="token operator">&</span> separator<span class="token punctuation">,</span> boost<span class="token double-colon punctuation">::</span>algorithm<span class="token double-colon punctuation">::</span>token_compress_mode_type compress <span class="token operator">=</span> boost<span class="token double-colon punctuation">::</span>algorithm<span class="token double-colon punctuation">::</span>token_compress_off<span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre>
<p>参数说明:</p>
<ul>
<li><code>result</code>:用于存储分割后的子字符串的 <code>std::vector</code> 。</li>
<li><code>input</code>:待分割的字符串。</li>
<li><code>separator</code>:分隔符,可以是单个字符,也可以是多个字符。</li>
<li><code>compress</code>:
<ul>
<li><code>boost::algorithm::token_compress_on</code>:合并多个相邻的分隔符,避免空字符串。</li>
<li><code>boost::algorithm::token_compress_off</code>(默认):不会合并分隔符,相邻分隔符会导致空字符串。</li>
</ul>
</li>
</ul>
<hr />
<p>分离完字符串就可以把它们取出来了</p>
<pre><code class="prism language-cpp"> DocInfo<span class="token operator">*</span> <span class="token function">BuildForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> line<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//1. 解析line,字符串切分</span> <span class="token comment">//line 切分为:title content url这3个字符串</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> results<span class="token punctuation">;</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string sep <span class="token operator">=</span> <span class="token string">\"\\3\"</span><span class="token punctuation">;</span><span class="token comment">//行内分割符</span> ns_util<span class="token double-colon punctuation">::</span><span class="token class-name">StringUtil</span><span class="token double-colon punctuation">::</span><span class="token function">Split</span><span class="token punctuation">(</span>line<span class="token punctuation">,</span><span class="token operator">&</span>results<span class="token punctuation">,</span>sep<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span><span class="token punctuation">(</span>results<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">!=</span><span class="token number">3</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">//2.字符串进行填充到DocInfo</span> DocInfo doc<span class="token punctuation">;</span>  doc<span class="token punctuation">.</span>title <span class="token operator">=</span> results<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">;</span> doc<span class="token punctuation">.</span>content <span class="token operator">=</span> results<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">;</span> doc<span class="token punctuation">.</span>url <span class="token operator">=</span> results<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">;</span> doc<span class="token punctuation">.</span>doc_id <span class="token operator">=</span> forward_index<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">//先保存id,再插入,对应的id就是当前doc在vector中的下标</span> <span class="token comment">//3.插入到正排索引的vector</span> forward_index<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>doc<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token operator">&</span>forward_index<span class="token punctuation">.</span><span class="token function">back</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">//其实就是返回当前新构成的文档</span> <span class="token punctuation">}</span></code></pre>
<h4>4.3.3 实现<code>BuildInvertedIndex</code>函数</h4>
<p>想要实现倒排索引,就必须要先分词(后面讲,工具类中会包含分词类,用到了jieba库)。<br /> 变量<code>doc</code>中已经有了文档的<code>title</code>和<code>content</code>了,我们可以先创建一个结构体<code>word_cnt</code>目的是为了统计<code>title/content</code>中分词后的各字符串出现的次数,方便后续计算权重。<br /> 在该函数中,我们需要分别把<code>title/content</code>交给分词工具,分词工具会帮助我们分好词,我拿到分好的词就要开始词频统计,统计完后我们就可以开始把<code>InvertedElem</code>补充完整。补充完整后就可以把它存储到属性<code>inverted_index</code>中了。</p>
<pre><code class="prism language-cpp"><span class="token keyword">bool</span> <span class="token function">BuildInvertedIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> DocInfo <span class="token operator">&</span>doc<span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span> <span class="token keyword">struct</span> <span class="token class-name">word_cnt</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">int</span> title_cnt<span class="token punctuation">;</span> <span class="token keyword">int</span> content_cnt<span class="token punctuation">;</span> <span class="token function">word_cnt</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span> <span class="token function">title_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">content_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span> word_cnt<span class="token operator">></span> word_map<span class="token punctuation">;</span> <span class="token comment">// 临时存储映射表</span> <span class="token comment">// 对标题进行分词</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> title_words<span class="token punctuation">;</span> ns_util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>doc<span class="token punctuation">.</span>title<span class="token punctuation">,</span> <span class="token operator">&</span>title_words<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 对标题进行词频统计</span> <span class="token keyword">for</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>string s <span class="token operator">:</span> title_words<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 转小写</span> word_map<span class="token punctuation">[</span>s<span class="token punctuation">]</span><span class="token punctuation">.</span>title_cnt<span class="token operator">++</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// 对文档内容进行分词</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> content_words<span class="token punctuation">;</span> ns_util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>doc<span class="token punctuation">.</span>content<span class="token punctuation">,</span> <span class="token operator">&</span>content_words<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 对内容进行词频统计</span> <span class="token keyword">for</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>string s <span class="token operator">:</span> content_words<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span> word_map<span class="token punctuation">[</span>s<span class="token punctuation">]</span><span class="token punctuation">.</span>content_cnt<span class="token operator">++</span><span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">X</span> <span class="token expression"><span class="token number">10</span></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">Y</span> <span class="token expression"><span class="token number">1</span></span></span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> <span class="token operator">&</span>word_pair <span class="token operator">:</span> word_map<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> InvertedElem item<span class="token punctuation">;</span> item<span class="token punctuation">.</span>doc_id <span class="token operator">=</span> doc<span class="token punctuation">.</span>doc_id<span class="token punctuation">;</span> item<span class="token punctuation">.</span>word <span class="token operator">=</span> word_pair<span class="token punctuation">.</span>first<span class="token punctuation">;</span> item<span class="token punctuation">.</span>weight <span class="token operator">=</span> X <span class="token operator">*</span> word_pair<span class="token punctuation">.</span>second<span class="token punctuation">.</span>title_cnt <span class="token operator">+</span> Y <span class="token operator">*</span> word_pair<span class="token punctuation">.</span>second<span class="token punctuation">.</span>content_cnt<span class="token punctuation">;</span> InvertedList <span class="token operator">&</span>inverted_list <span class="token operator">=</span> inverted_index<span class="token punctuation">[</span>word_pair<span class="token punctuation">.</span>first<span class="token punctuation">]</span><span class="token punctuation">;</span><span class="token comment">//key不存在会自动创建</span> inverted_list<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>item<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre>
<h4>4.3.4 知识补充——cppjieba库</h4>
<p>本文用到的分词工具都是来自<code>cppjieba</code>库<br /> 打开github,<br /> <img src="https://i-blog.csdnimg.cn/img_convert/c9bac59dd2893c34f27866fee1c6d1d5.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250219161345" /><br /> 就是星标最多的那个。<br /> 点击后,就可以开始clone了</p>
<pre><code class="prism language-shell"><span class="token function">git</span> clone https://github.com/yanyiwu/cppjieba.git</code></pre>
<p>在终端输入该指令,即可安装最新版本的jieba库。<br /> 其实我是不推荐这个最新版本的,不知道为什么昨天晚上尝试这个最新版本一直达不到要求,可能我的机器配置有点小小的问题吧,按下面的教程也没有把demo跑出来。<br /> 最终我也是下载了历史版本,我下载的22年的一个版本。<br /> 如果你不知道怎么版本回退,在输入完上面指令后进入<code>cppjieba</code>目录输入以下指令:</p>
<pre><code class="prism language-shell"><span class="token function">git</span> checkout e81930b7c2266ce0b1a820f2bbad55098859f833</code></pre>
<p>这样你的版本就和我一致了。<br /> 下面我会教你跑通<code>demo.cpp</code>文件。<br /> <img src="https://i-blog.csdnimg.cn/img_convert/50e47ab943c4115498ab589be3677e4e.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250219162541" /><br /> 先输入上面3个指令啊:</p>
<pre><code class="prism language-shell"><span class="token function">cp</span> cppjieba/test/demo.cpp ./ <span class="token comment">#将demo文件拷贝出来测试用</span><span class="token function">ln</span> <span class="token parameter variable">-s</span> cppjieba/dict dict <span class="token comment">#该目录中存储都是各种词库,没有这个也分不了词</span><span class="token function">ln</span> <span class="token parameter variable">-s</span> cppjieba/include inc <span class="token comment">#存储各种头文件的地方</span></code></pre>
<p>处理完这些后我们demo文件也需要小改,比较所处的目录不同了。<br /> <img src="https://i-blog.csdnimg.cn/img_convert/66feffd64d3aeac919bac1ddab2334c3.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250219163121" /><br /> 主要改动有两处,就是我箭头所指方向。<br /> 当然这个其实还是不行的,我们还需要把<code>cppjieba/deps/limonp</code>拷贝到<code>cppjieba/include/cppjieba</code>当中<br /> 拷贝后的效果:<br /> <img src="https://i-blog.csdnimg.cn/img_convert/ee16497e3e28d8ab8e0859bfb15dbb79.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250219163406" /><br /> 这其实也是jieba库的一个小bug。<br /> 下面我们就可以编译demo文件了<br /> <img src="https://i-blog.csdnimg.cn/img_convert/9bb2bfc560f4e6f0160ea51ae2286c76.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250219163508" /><br /> 你可以看到这种效果就是成功了。<br /> 下面我们要解决的就是引入jieba库到我们的工作目录了。<br /> 我们的<code>cppjieba</code>库存储在<code>~/mylib</code>中。<br /> 所以我可以这样引入:</p>
<pre><code class="prism language-shell"><span class="token function">ln</span> <span class="token parameter variable">-s</span> ~/mylib/cppjieba/include/cppjieba cppjieba<span class="token function">ln</span> <span class="token parameter variable">-s</span> ~/mylib/cppjieba/dict dict</code></pre>
<p>这样就成功在我们的工作目录成功建立其了软链接了。<br /> <img src="https://i-blog.csdnimg.cn/img_convert/5ccd3ec554d1b0cfb345d29fb58df0fc.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250219164041" /></p>
<h4>4.3.5 工具类的补充——分词功能</h4>
<p>在上面的倒排索引中,我们使用到了工具类,所以我们的工具类中肯定还要补充分词功能。<br /> 首先我们就要引用头文件</p>
<pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">\"cppjieba/Jieba.hpp\"</span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span></code></pre>
<p>引入互斥锁的目的是为了,实现单例模式。没错这个分词类我也打算设计成单例模式。<br /> 这个分词类主要在工作的函数就是它了<code>jieba.CutForSearch(src, *out);</code>,把字符串src给它就就会分好词如何存储在<code>out</code>当中,调库大法就这么简单。<br /> 当然那样就太简单了,所以我们这个类是实现了去除暂停词的功能。<br /> 那么,肯定有人不知道上面是暂停词了。</p>
<blockquote>
<p>停用词是指在信息检索中,为节省存储空间和提高搜索效率,在处理自然语言数据(或文本)之前或之后会自动过滤掉某些字或词,这些字或词即被称为Stop Words(停用词)。这些停用词都是人工输入、非自动化生成的,生成后的停用词会形成一个停用词表。但是,并没有一个明确的停用词表能够适用于所有的工具。甚至有一些工具是明确地避免使用停用词来支持短语搜索的。</p>
</blockquote>
<p>这是百度百科的解释</p>
<p>其实举个例子:因为xxx,所以xxx。里的因为、所以就是暂停,这对搜索没有帮助<br /> 还要英文中的 the is 这些也都是暂停。<br /> 打开,jieba库中的<code>stop_words.utf8</code>这里面的都是暂停词<br /> <img src="https://i-blog.csdnimg.cn/img_convert/babfec529a19456aed8336ca2e715c4e.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250219170646" /><br /> 有了这些暂停词,我们就可以去除分完词后的<code>out</code>里的内容了,写一个循环即可。<br /> 下面看代码是如何实现的把:</p>
<pre><code class="prism language-cpp"><span class="token comment">//首先先引入词库</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> DICT_PATH <span class="token operator">=</span> <span class="token string">\"./dict/jieba.dict.utf8\"</span><span class="token punctuation">;</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> HMM_PATH <span class="token operator">=</span> <span class="token string">\"./dict/hmm_model.utf8\"</span><span class="token punctuation">;</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> USER_DICT_PATH <span class="token operator">=</span> <span class="token string">\"./dict/user.dict.utf8\"</span><span class="token punctuation">;</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> IDF_PATH <span class="token operator">=</span> <span class="token string">\"./dict/idf.utf8\"</span><span class="token punctuation">;</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> STOP_WORD_PATH <span class="token operator">=</span> <span class="token string">\"./dict/stop_words.utf8\"</span><span class="token punctuation">;</span><span class="token keyword">class</span> <span class="token class-name">JiebaUtil</span><span class="token punctuation">{<!-- --></span><span class="token keyword">private</span><span class="token operator">:</span> cppjieba<span class="token double-colon punctuation">::</span>Jieba jieba<span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span> <span class="token keyword">bool</span><span class="token operator">></span> stop_words<span class="token punctuation">;</span><span class="token comment">//判断暂停词</span><span class="token keyword">private</span><span class="token operator">:</span> <span class="token function">JiebaUtil</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span> <span class="token function">jieba</span><span class="token punctuation">(</span>DICT_PATH<span class="token punctuation">,</span> HMM_PATH<span class="token punctuation">,</span> USER_DICT_PATH<span class="token punctuation">,</span> IDF_PATH<span class="token punctuation">,</span> STOP_WORD_PATH<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token punctuation">}</span> <span class="token function">JiebaUtil</span><span class="token punctuation">(</span><span class="token keyword">const</span> JiebaUtil <span class="token operator">&</span><span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token keyword">delete</span><span class="token punctuation">;</span><span class="token keyword">public</span><span class="token operator">:</span> <span class="token keyword">static</span> JiebaUtil <span class="token operator">*</span><span class="token function">get_instance</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> instance<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> mtx<span class="token punctuation">.</span><span class="token function">lock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> instance<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> instance <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token function">JiebaUtil</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> instance<span class="token operator">-></span><span class="token function">InitJiebaUtil</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> mtx<span class="token punctuation">.</span><span class="token function">unlock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> instance<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">void</span> <span class="token function">InitJiebaUtil</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>ifstream <span class="token function">in</span><span class="token punctuation">(</span>STOP_WORD_PATH<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span>in<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token function">LOG</span><span class="token punctuation">(</span>FATAL<span class="token punctuation">,</span> <span class="token string">\"load stop wordsfile error\"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> std<span class="token double-colon punctuation">::</span>string line<span class="token punctuation">;</span> <span class="token keyword">while</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">getline</span><span class="token punctuation">(</span>in<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> stop_words<span class="token punctuation">.</span><span class="token function">insert</span><span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span>line<span class="token punctuation">,</span> <span class="token boolean">true</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> in<span class="token punctuation">.</span><span class="token function">close</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">void</span> <span class="token function">CutStringHelper</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>src<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> <span class="token operator">*</span>out<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> jieba<span class="token punctuation">.</span><span class="token function">CutForSearch</span><span class="token punctuation">(</span>src<span class="token punctuation">,</span> <span class="token operator">*</span>out<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> iter <span class="token operator">=</span> out<span class="token operator">-></span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> iter <span class="token operator">!=</span> out<span class="token operator">-></span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">auto</span> it <span class="token operator">=</span> stop_words<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span><span class="token operator">*</span>iter<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>it <span class="token operator">!=</span> stop_words<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">// 说明当前的string是暂停词,需要去除</span> iter <span class="token operator">=</span> out<span class="token operator">-></span><span class="token function">erase</span><span class="token punctuation">(</span>iter<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">else</span> <span class="token punctuation">{<!-- --></span> iter<span class="token operator">++</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token keyword">public</span><span class="token operator">:</span> <span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">CutString</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>src<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> <span class="token operator">*</span>out<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> ns_util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">get_instance</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-></span><span class="token function">CutStringHelper</span><span class="token punctuation">(</span>src<span class="token punctuation">,</span> out<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token keyword">private</span><span class="token operator">:</span> <span class="token keyword">static</span> JiebaUtil <span class="token operator">*</span>instance<span class="token punctuation">;</span> <span class="token keyword">static</span> std<span class="token double-colon punctuation">::</span>mutex mtx<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token punctuation">;</span>JiebaUtil <span class="token operator">*</span>JiebaUtil<span class="token double-colon punctuation">::</span>instance <span class="token operator">=</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>std<span class="token double-colon punctuation">::</span>mutex JiebaUtil<span class="token double-colon punctuation">::</span>mtx<span class="token punctuation">;</span></code></pre>
<p>至此<br /> 工具类全部完成</p>
<h4>4.3.6 index类补充——终</h4>
<p>其实我们的index类并没有完成完成,毕竟我们还没有提供根据<code>doc_id</code>找到相对应文档和根据字符串找到倒排拉链的功能。<br /> 这两功能都很重要的,要不然光建立了正排索引和倒排索引搜索不到内容啊。<br /> 这两个功能都很好实现,因为就是简单的搜索<br /> 根据<code>doc_id</code>找到文档内容</p>
<pre><code class="prism language-cpp">DocInfo<span class="token operator">*</span> <span class="token function">GetForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">if</span><span class="token punctuation">(</span>doc_id <span class="token operator">>=</span> forward_index<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token function">LOG</span><span class="token punctuation">(</span>ERROR<span class="token punctuation">,</span><span class="token string">\"doc_id out range,error!\"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token operator">&</span>forward_index<span class="token punctuation">[</span>doc_id<span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token punctuation">}</span></code></pre>
<p>根据关键词找到倒排拉链</p>
<pre><code class="prism language-cpp">InvertedList<span class="token operator">*</span> <span class="token function">GetInvertedList</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> word<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">auto</span> iter <span class="token operator">=</span> inverted_index<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span><span class="token punctuation">(</span>iter <span class="token operator">==</span> inverted_index<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token function">LOG</span><span class="token punctuation">(</span>WARNING<span class="token punctuation">,</span><span class="token string">\"%s have no InvertedList\"</span><span class="token punctuation">,</span>word<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token operator">&</span><span class="token punctuation">(</span>iter<span class="token operator">-></span>second<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> </code></pre>
<h2>5. 检索与排序</h2>
<p>我们应该怎么检索呢?<br /> 在我们日常的搜索过程中,肯定都是在搜索框内输入一串字符串,然后就是等待搜索结果。<br /> 那么在我们等待的过程中,搜索引擎都做了什么呢?假设是搜索引擎启动前。<br /> 这就是我们目前要考虑的<br /> 首先它肯定会经历初始化,初始化就是建立正排索引和倒排索引。<br /> 只有初始化完成后,我们才可以进行搜索。<br /> 然后搜索引擎会拿到我要搜索的内容<code>query</code>,进行搜索.<br /> 最后就是显示出搜索结果。<br /> 没错,这就是我待会要实现的2个功能</p>
<ol>
<li>初始化</li>
<li>搜索<br /> 下面就是我们要实现<code>Searcher</code>类的基本结构了<br /> 功能就是上面3个,属性肯定需要索引类。</li>
</ol>
<pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">pragma</span> <span class="token expression">once</span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">\"index.hpp\"</span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">\"util.hpp\"</span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">\"Log.hpp\"</span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token keyword">namespace</span> ns_searcher<span class="token punctuation">{<!-- --></span> <span class="token keyword">struct</span> <span class="token class-name">InvertedElemPrint</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span> <span class="token keyword">int</span> weight<span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> words<span class="token punctuation">;</span> <span class="token function">InvertedElemPrint</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span><span class="token function">doc_id</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token function">weight</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <span class="token keyword">class</span> <span class="token class-name">Searcher</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">private</span><span class="token operator">:</span> ns_index<span class="token double-colon punctuation">::</span>Index<span class="token operator">*</span> index<span class="token punctuation">;</span><span class="token comment">//提供查找需要的索引</span> <span class="token keyword">public</span><span class="token operator">:</span> <span class="token function">Searcher</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token operator">~</span><span class="token function">Searcher</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token keyword">public</span><span class="token operator">:</span>  <span class="token keyword">void</span> <span class="token function">InitSearcher</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> input<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token keyword">void</span> <span class="token function">Search</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> query<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span>json_string<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span><span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre>
<h3>5.2 初始化</h3>
<p>Searcher类的初始化化非常的简单,只需要两步,创建正排索引、创建倒排索引</p>
<pre><code class="prism language-cpp"><span class="token keyword">void</span> <span class="token function">InitSearcher</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> input<span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span><span class="token comment">//1.获取或者创建index对象</span>index <span class="token operator">=</span> ns_index<span class="token double-colon punctuation">::</span><span class="token class-name">Index</span><span class="token double-colon punctuation">::</span><span class="token function">GetInstance</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token function">LOG</span><span class="token punctuation">(</span>INFO<span class="token punctuation">,</span><span class="token string">\"获取index单例成功...\"</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">//2.根据index对象创建索引</span>index<span class="token operator">-></span><span class="token function">BuildIndex</span><span class="token punctuation">(</span>input<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token function">LOG</span><span class="token punctuation">(</span>INFO<span class="token punctuation">,</span><span class="token string">\"建立正排和倒排索引成功...\"</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre>
<h3>5.2 实现<code>Search</code>函数</h3>
<p>搜索就是你提供一个搜索关键词,然后返回结果。在这个过程中,我们需要搜索关键词分词,然后再拿分词后的结果在倒排索引中搜索,得到搜索结果后,按照权重weight来排降序。然后就是根据查找出来的结果,进行序列化,这里可以使用<code>Jsoncpp</code>库。<br /> 总结就是4点:</p>
<ol>
<li>对搜索关键词进行分词。</li>
<li>根据分词进行index查找。</li>
<li>汇总查找结果,按照权重weight来排降序。</li>
<li>根据查找出来的结果,进行序列化,构建json串。<br /> 同时我们要注意,一般在我搜索时,并不能看到文档的全部内容,都是节选内容<br /> <img src="https://i-blog.csdnimg.cn/img_convert/570b98725b5be8f0020301f67020ff5a.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250220141056" /><br /> 所以我们也需要实现,这一功能,我们需要在写一个节选<code>content</code>的函数来实现这个功能,这个函数就是<code>GetDesc</code></li>
</ol>
<pre><code class="prism language-cpp"><span class="token keyword">void</span> <span class="token function">Search</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> query<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span>json_string<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//1.对搜索关键词进行分词</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> words<span class="token punctuation">;</span> ns_util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span><span class="token operator">&</span>words<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//2.根据分词进行index查找</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElemPrint<span class="token operator">></span> inverted_list_all<span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span><span class="token keyword">uint64_t</span><span class="token punctuation">,</span>InvertedElemPrint<span class="token operator">></span> tokens_map<span class="token punctuation">;</span> <span class="token keyword">for</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>string word<span class="token operator">:</span>words<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span> ns_index<span class="token double-colon punctuation">::</span>InvertedList<span class="token operator">*</span> inverted_list <span class="token operator">=</span> index<span class="token operator">-></span><span class="token function">GetInvertedList</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> inverted_list<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">continue</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">auto</span><span class="token operator">&</span>elem<span class="token operator">:</span><span class="token operator">*</span>inverted_list<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">auto</span><span class="token operator">&</span> item <span class="token operator">=</span> tokens_map<span class="token punctuation">[</span>elem<span class="token punctuation">.</span>doc_id<span class="token punctuation">]</span><span class="token punctuation">;</span> item<span class="token punctuation">.</span>doc_id <span class="token operator">=</span> elem<span class="token punctuation">.</span>doc_id<span class="token punctuation">;</span> item<span class="token punctuation">.</span>weight <span class="token operator">+=</span> elem<span class="token punctuation">.</span>weight<span class="token punctuation">;</span> item<span class="token punctuation">.</span>words<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>elem<span class="token punctuation">.</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">auto</span><span class="token operator">&</span>item<span class="token operator">:</span>tokens_map<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> inverted_list_all<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>item<span class="token punctuation">.</span>second<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">//3.汇总查找结果,按照权重weight来排降序</span> std<span class="token double-colon punctuation">::</span><span class="token function">sort</span><span class="token punctuation">(</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>\\ <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> InvertedElemPrint<span class="token operator">&</span>e1<span class="token punctuation">,</span><span class="token keyword">const</span> InvertedElemPrint<span class="token operator">&</span>e2<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> e1<span class="token punctuation">.</span>weight<span class="token operator">></span>e2<span class="token punctuation">.</span>weight<span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//4.根据查找出来的结果,进行序列化,构建json串</span> Json<span class="token double-colon punctuation">::</span>Value root<span class="token punctuation">;</span> <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> item<span class="token operator">:</span>inverted_list_all<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> ns_index<span class="token double-colon punctuation">::</span>DocInfo<span class="token operator">*</span> doc <span class="token operator">=</span> index<span class="token operator">-></span><span class="token function">GetForwardIndex</span><span class="token punctuation">(</span>item<span class="token punctuation">.</span>doc_id<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> doc<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">continue</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> Json<span class="token double-colon punctuation">::</span>Value elem<span class="token punctuation">;</span> elem<span class="token punctuation">[</span><span class="token string">\"title\"</span><span class="token punctuation">]</span> <span class="token operator">=</span> doc<span class="token operator">-></span>title<span class="token punctuation">;</span> elem<span class="token punctuation">[</span><span class="token string">\"desc\"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token function">GetDesc</span><span class="token punctuation">(</span>doc<span class="token operator">-></span>content<span class="token punctuation">,</span>item<span class="token punctuation">.</span>words<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> elem<span class="token punctuation">[</span><span class="token string">\"url\"</span><span class="token punctuation">]</span> <span class="token operator">=</span> doc<span class="token operator">-></span>url<span class="token punctuation">;</span> elem<span class="token punctuation">[</span><span class="token string">\"id\"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token keyword">int</span><span class="token punctuation">)</span>item<span class="token punctuation">.</span>doc_id<span class="token punctuation">;</span> elem<span class="token punctuation">[</span><span class="token string">\"weight\"</span><span class="token punctuation">]</span> <span class="token operator">=</span> item<span class="token punctuation">.</span>weight<span class="token punctuation">;</span> root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>elem<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> Json<span class="token double-colon punctuation">::</span>FastWriter write<span class="token punctuation">;</span> <span class="token operator">*</span>json_string <span class="token operator">=</span> write<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>root<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span></code></pre>
<h4>5.2.1 实现内容节选<code>GetDesc</code>函数</h4>
<p>我们需要需要需要内容节选,这里我打算选取关键词首次出现的前50个字符到后100个字符。<br /> 首先还是需要找到关键词首次出现的位置,我们可以直接调库<code>std::search</code>,同时我们还要忽略大小写,所以还写一个自定义函数给它。<br /> 然后就是截取关键词出现的前后位置,注意可能关键词前后没有那么多字符,这时候就要特判一下了。</p>
<pre><code class="prism language-cpp">std<span class="token double-colon punctuation">::</span>string <span class="token function">GetDesc</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> html_content<span class="token punctuation">,</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span>word<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">//节选部分内容</span> <span class="token keyword">const</span> <span class="token keyword">int</span> prev_step <span class="token operator">=</span> <span class="token number">50</span><span class="token punctuation">;</span> <span class="token keyword">const</span> <span class="token keyword">int</span> next_step <span class="token operator">=</span> <span class="token number">100</span><span class="token punctuation">;</span> <span class="token comment">//1.找到首次出现的位置</span> <span class="token keyword">auto</span> iter <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span><span class="token function">search</span><span class="token punctuation">(</span>html_content<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>html_content<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>word<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>word<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">int</span> x<span class="token punctuation">,</span><span class="token keyword">int</span> y<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">tolower</span><span class="token punctuation">(</span>x<span class="token punctuation">)</span> <span class="token operator">==</span> std<span class="token double-colon punctuation">::</span><span class="token function">tolower</span><span class="token punctuation">(</span>y<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span><span class="token punctuation">(</span>iter <span class="token operator">==</span> html_content<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token keyword">return</span> <span class="token string">\"None1\"</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">int</span> pos <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span><span class="token function">distance</span><span class="token punctuation">(</span>html_content<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>iter<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//2.获取start,end下标</span> <span class="token keyword">int</span> start <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token keyword">int</span> end <span class="token operator">=</span> html_content<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">;</span> <span class="token comment">//判断前方是否有足够字符</span> <span class="token keyword">if</span><span class="token punctuation">(</span>pos<span class="token operator">></span>start<span class="token operator">+</span>prev_step<span class="token punctuation">)</span> start <span class="token operator">=</span> pos <span class="token operator">-</span> prev_step<span class="token punctuation">;</span> <span class="token keyword">if</span><span class="token punctuation">(</span>pos<span class="token operator"><</span>end<span class="token operator">-</span>next_step<span class="token punctuation">)</span> end <span class="token operator">=</span> pos<span class="token operator">+</span>next_step<span class="token punctuation">;</span> <span class="token comment">// 3.截取字串</span> <span class="token keyword">if</span><span class="token punctuation">(</span>start<span class="token operator">>=</span>end<span class="token punctuation">)</span> <span class="token keyword">return</span> <span class="token string">\"None2\"</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string desc <span class="token operator">=</span> html_content<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>start<span class="token punctuation">,</span>end<span class="token operator">-</span>start<span class="token punctuation">)</span><span class="token punctuation">;</span> desc<span class="token operator">+=</span><span class="token string">\"...\"</span><span class="token punctuation">;</span> <span class="token keyword">return</span> desc<span class="token punctuation">;</span> <span class="token punctuation">}</span></code></pre>
<h2>6. 简单Debug</h2>
<p>我们写一个小小的<code>debug.cpp</code>来看看效果吧。后续就要让它上网了</p>
<pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">\"searcher.hpp\"</span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string input <span class="token operator">=</span> <span class="token string">\"data/raw_html/raw.txt\"</span><span class="token punctuation">;</span><span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span> <span class="token comment">//for test</span> ns_searcher<span class="token double-colon punctuation">::</span>Searcher <span class="token operator">*</span>search <span class="token operator">=</span> <span class="token keyword">new</span> ns_searcher<span class="token double-colon punctuation">::</span><span class="token function">Searcher</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> search<span class="token operator">-></span><span class="token function">InitSearcher</span><span class="token punctuation">(</span>input<span class="token punctuation">)</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string query<span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string json_string<span class="token punctuation">;</span> <span class="token keyword">char</span> buffer<span class="token punctuation">[</span><span class="token number">1024</span><span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token keyword">while</span><span class="token punctuation">(</span><span class="token boolean">true</span><span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span> std<span class="token double-colon punctuation">::</span>cout <span class="token operator"><<</span> <span class="token string">\"Please Enter You Search Query# \"</span><span class="token punctuation">;</span> <span class="token function">fgets</span><span class="token punctuation">(</span>buffer<span class="token punctuation">,</span> <span class="token keyword">sizeof</span><span class="token punctuation">(</span>buffer<span class="token punctuation">)</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token constant">stdin</span><span class="token punctuation">)</span><span class="token punctuation">;</span> buffer<span class="token punctuation">[</span><span class="token function">strlen</span><span class="token punctuation">(</span>buffer<span class="token punctuation">)</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> query <span class="token operator">=</span> buffer<span class="token punctuation">;</span> search<span class="token operator">-></span><span class="token function">Search</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span> <span class="token operator">&</span>json_string<span class="token punctuation">)</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>cout <span class="token operator"><<</span> json_string <span class="token operator"><<</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre>
<p><strong>编写makefile文件</strong></p>
<pre><code class="prism language-shell">.PHONY:allall:parser debugparser:parser.cppg++ <span class="token parameter variable">-o</span> <span class="token variable">$@</span> $^ <span class="token parameter variable">-std</span><span class="token operator">=</span>c++11 <span class="token parameter variable">-lboost_system</span> <span class="token parameter variable">-lboost_filesystem</span>debug:debug.cppg++ <span class="token parameter variable">-o</span> <span class="token variable">$@</span> $^ <span class="token parameter variable">-std</span><span class="token operator">=</span>c++11 <span class="token parameter variable">-ljsoncpp</span>.PHONY:cleanclean:<span class="token function">rm</span> <span class="token parameter variable">-rf</span> parser debug</code></pre>
<p>输入<code>make</code>便会生成两个可执行文件,我先运行<code>parser</code>文件先生成文档数据,然后运行<code>debug</code>文件<br /> 注意:可能执行时间会比较长(笔者是低配云服务器)<br /> <img src="https://i-blog.csdnimg.cn/img_convert/a612256beae0fc5c8ed5b07a8ae53e7a.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250220142555" /><br /> 大概会建立这么多的索引,然后你就可以输入你先搜索的内容,这里我也是搜索了<code>split</code>这个关键词。<br /> <img src="https://i-blog.csdnimg.cn/img_convert/1a6a646becf4b633cc9cb57717f3b863.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250220142713" /><br /> 可以看到,非常的不美观,这是因为我们的<code>Json</code>串的格式没有选好,我们选择的格式为<code>Json::FastWriter</code>,它会把内容挤在一起类似这样:</p>
<pre><code class="prism language-txt">{\"name\":\"Alice\",\"age\":25,\"city\":\"New York\"}</code></pre>
<p>所以可以选择别的格式,如<code>Json::StyledWriter</code>它的格式类似于</p>
<pre><code class="prism language-txt">{ \"name\": \"Alice\", \"age\": 25, \"city\": \"New York\"}</code></pre>
<p>当然这里我就不修改了,因为最后我们是显示在网页的,终端的格式就无所谓了。</p>
<h2>7. 将搜索引擎入网</h2>
<p>现在我们搜索引擎再厉害也是自娱自乐,如果我想让别人也可以访问呢?为了让别人也能够访问,这样的话,我们就必须再写一个服务器。这个当然可以直接写,只要要学过<code>socket</code>相关的知识即可,再过去的文章中,我们也写过一些简单的聊天服务器。不过你知道的,还是现成的香,而且还经过了大家的检验,不容易出错。所以我们下面将引入<code>cpp-httplib</code>库</p>
<h3>7.1 <code>cpp-httplib</code>库的引入</h3>
<p><img src="https://i-blog.csdnimg.cn/img_convert/6e9ae0d644fd21146d4d354cf751cfbc.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250220144616" /><br /> 打开github搜索<code>cpp-httplib</code>即可<br /> 不想搜索可以点击链接cpp-httplib<br /> 还是熟悉的操作,在终端输入<code>git clone https://github.com/yhirose/cpp-httplib.git</code><br /> 即可把文件下载到当前目录下。<br /> 你可以把它放在工作目录,当然也可以和我一样。<br /> 我把它放在了<code>~/mylib</code>下,然后引入软连接</p>
<pre><code class="prism language-shell"><span class="token function">ln</span> <span class="token parameter variable">-s</span> ~/mylib/cpp-httplib-master cpp-httplib</code></pre>
<h3>7.2 编写<code>http_server.cc</code>文件</h3>
<p>创建一个服务器需要创建套接字<code>socket</code>以及bind结构体下面就是listen和accept。但是现在我什么都不用做,直接调库。<br /> 先创建一个http实例,然后就是使用<code>Get</code>来等待消息即可。</p>
<pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">\"cpp-httplib/httplib.h\"</span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">\"searcher.hpp\"</span></span><span class="token comment">// 定义输入文件路径和服务器根目录</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string input <span class="token operator">=</span> <span class="token string">\"data/raw_html/raw.txt\"</span><span class="token punctuation">;</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string root_path <span class="token operator">=</span> <span class="token string">\"./wwwroot\"</span><span class="token punctuation">;</span><span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span> <span class="token comment">// 初始化搜索器</span> ns_searcher<span class="token double-colon punctuation">::</span>Searcher searcher<span class="token punctuation">;</span> searcher<span class="token punctuation">.</span><span class="token function">InitSearcher</span><span class="token punctuation">(</span>input<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 创建HTTP服务器实例</span> httplib<span class="token double-colon punctuation">::</span>Server svr<span class="token punctuation">;</span> svr<span class="token punctuation">.</span><span class="token function">set_base_dir</span><span class="token punctuation">(</span>root_path<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">//添加主网页</span> <span class="token comment">// 设置GET请求处理函数</span> svr<span class="token punctuation">.</span><span class="token function">Get</span><span class="token punctuation">(</span><span class="token string">\"/s\"</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token operator">&</span>searcher<span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> httplib<span class="token double-colon punctuation">::</span>Request<span class="token operator">&</span> req<span class="token punctuation">,</span> httplib<span class="token double-colon punctuation">::</span>Response<span class="token operator">&</span> res<span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">// 检查请求参数中是否包含\'query\'</span> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>req<span class="token punctuation">.</span><span class="token function">has_param</span><span class="token punctuation">(</span><span class="token string">\"query\"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{<!-- --></span> res<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span><span class="token string">\"param query is required\"</span><span class="token punctuation">,</span> <span class="token string">\"text/plain\"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// 获取查询参数</span> std<span class="token double-colon punctuation">::</span>string query <span class="token operator">=</span> req<span class="token punctuation">.</span><span class="token function">get_param_value</span><span class="token punctuation">(</span><span class="token string">\"query\"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 记录日志</span> <span class="token function">LOG</span><span class="token punctuation">(</span>INFO<span class="token punctuation">,</span><span class="token string">\"query:%s\"</span><span class="token punctuation">,</span>query<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string json_string<span class="token punctuation">;</span> <span class="token comment">// 执行搜索并获取结果</span> searcher<span class="token punctuation">.</span><span class="token function">Search</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span> <span class="token operator">&</span>json_string<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 设置响应内容类型为JSON并返回结果</span> res<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span>json_string<span class="token punctuation">,</span><span class="token string">\"text/json\"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 记录服务器启动日志</span> <span class="token function">LOG</span><span class="token punctuation">(</span>INFO<span class="token punctuation">,</span><span class="token string">\"start server\"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 启动服务器监听</span> svr<span class="token punctuation">.</span><span class="token function">listen</span><span class="token punctuation">(</span><span class="token string">\"0.0.0.0\"</span><span class="token punctuation">,</span> <span class="token number">8080</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre>
<p>当然现在我们还没有写前端代码。<br /> 我们先来看看效果:<br /> <img src="https://i-blog.csdnimg.cn/img_convert/686528a4180dbf2862bf92ef388cd02b.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250220163258" /><br /> 现在我们没有输入请求<br /> 当我们输入请求<code>query=split</code>时<br /> <img src="https://i-blog.csdnimg.cn/img_convert/a20453167daa7fc8cd1675611840182e.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250220163446" /><br /> 可以看到显示内容了</p>
<h2>8. 前端内容</h2>
<p>目前笔者仅仅只是学习完了<code>html/css</code>的基础语法,对前端的知识并没有特别熟练,下面关于前端的讲解也只是简单的进行说明。<br /> 我先设计下搜索界面应该是什么样的<br /> <img src="https://i-blog.csdnimg.cn/img_convert/1db7f6d6fdc79445ab8e556e97fd64bc.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250220164258" /><br /> 感觉怎么样,是不是有点样子了。<br /> 我们先写写<code>html</code>文件看看吧</p>
<h3>8.1 <code>html</code>部分</h3>
<p>要想实现搜索框,那么在html里面什么和其最匹配呢?<br /> 当然就是<code>input</code>标签了,确认搜索框就是一个按钮嘛,也就是<code>button</code>标签。<br /> 我们把它们都放在<code>search</code>类中,方便后续美化。<br /> 关于,显示的格式,我们也可以先写,后续就是用js生成了。<br /> 注意<code><button>搜索一下</button> </code>中<code>onclick=\"Search()</code>会直接处罚调取js中的<code>Search</code>函数</p>
<pre><code class="prism language-html"><span class="token doctype"><span class="token punctuation"><!</span><span class="token doctype-tag">DOCTYPE</span> <span class="token name">html</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>html</span> <span class="token attr-name">lang</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>en<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>head</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">charset</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>UTF-8<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">name</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>viewport<span class="token punctuation">\"</span></span> <span class="token attr-name">content</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>width=device-width, initial-scale=1.0<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>title</span><span class="token punctuation">></span></span>Document<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>title</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>head</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>body</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>container<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>search<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>input</span> <span class="token attr-name">type</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>text<span class="token punctuation">\"</span></span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>请输入搜索关键字<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>button</span> <span class="token special-attr"><span class="token attr-name">onclick</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span><span class="token value javascript language-javascript"><span class="token function">Search</span><span class="token punctuation">(</span><span class="token punctuation">)</span></span><span class="token punctuation">\"</span></span></span><span class="token punctuation">></span></span>搜索一下<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>button</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>result<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span> <span class="token comment"><!-- 动态生成网页内容 --></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>item<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>#<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是<a href="http://www.csdndoc.com/tag/zy-11" title="View all posts in 摘要" target="_blank" style="color:#0ec3f3;font-size: 18px;font-weight: 600;">摘要</a>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://blog.csdn.net/2303_79015671?spm=1001.2101.3001.5343<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>item<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>#<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://blog.csdn.net/2303_79015671?spm=1001.2101.3001.5343<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>item<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>#<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://blog.csdn.net/2303_79015671?spm=1001.2101.3001.5343<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>item<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>#<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://blog.csdn.net/2303_79015671?spm=1001.2101.3001.5343<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>item<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">\"</span>#<span class="token punctuation">\"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://blog.csdn.net/2303_79015671?spm=1001.2101.3001.5343<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>body</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>html</span><span class="token punctuation">></span></span></code></pre>
<p><img src="https://i-blog.csdnimg.cn/img_convert/9c70a8a4091c7102300d268c39178f50.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250220171008" /><br /> 这大概就是搜索后的界面吧,不过现在这样肯定是不行的,我们需要用CSS美化,同时下面的内容只有在搜索后才会出现,没有搜索前是不显示的,所以后面我们还要用js处理下。</p>
<h3>8.2 <code>CSS</code>部分</h3>
<p>我们先来看看百度的搜索框是怎么实现的吧<br /> <img src="https://i-blog.csdnimg.cn/img_convert/795c8d04a028b6ff1b1f3f3868a51165.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250220171105" /><br /> 我们可以提取到的特征有:圆角矩形、搜索框和确定搜索框相连、确定搜索框是蓝底白字等等等</p>
<pre><code class="prism language-css"> <span class="token comment">/* 去掉网页中的所有的默认内外边距,html的盒子模型 */</span> <span class="token selector">*</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">/* 设置外边距 */</span> <span class="token property">margin</span><span class="token punctuation">:</span> 0<span class="token punctuation">;</span> <span class="token comment">/* 设置内边距 */</span> <span class="token property">padding</span><span class="token punctuation">:</span> 0<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/* 将我们的body内的内容100%和html的呈现吻合 */</span> <span class="token selector">html, body</span> <span class="token punctuation">{<!-- --></span> <span class="token property">height</span><span class="token punctuation">:</span> 100%<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/* 类选择器.container */</span> <span class="token selector">.container</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">/* 设置div的宽度 */</span> <span class="token property">width</span><span class="token punctuation">:</span> 800px<span class="token punctuation">;</span> <span class="token comment">/* 通过设置外边距达到居中对齐的目的 */</span> <span class="token property">margin</span><span class="token punctuation">:</span> 0px auto<span class="token punctuation">;</span> <span class="token comment">/* 设置外边距的上边距,保持元素和网页的上部距离 */</span> <span class="token property">margin-top</span><span class="token punctuation">:</span> 15px<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/* 复合选择器,选中container 下的 search */</span> <span class="token selector">.container .search</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">/* 宽度与父标签保持一致 */</span> <span class="token property">width</span><span class="token punctuation">:</span> 100%<span class="token punctuation">;</span> <span class="token comment">/* 高度设置为52px */</span> <span class="token property">height</span><span class="token punctuation">:</span> 52px<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/* 先选中input标签, 直接设置标签的属性,先要选中, input:标签选择器*/</span> <span class="token comment">/* input在进行高度设置的时候,没有考虑边框的问题 */</span> <span class="token selector">.container .search input</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">/* 设置left浮动 */</span> <span class="token property">float</span><span class="token punctuation">:</span> left<span class="token punctuation">;</span> <span class="token property">width</span><span class="token punctuation">:</span> 600px<span class="token punctuation">;</span> <span class="token property">height</span><span class="token punctuation">:</span> 50px<span class="token punctuation">;</span> <span class="token comment">/* 设置边框属性:边框的宽度,样式,颜色 */</span> <span class="token property">border</span><span class="token punctuation">:</span> 1px solid black<span class="token punctuation">;</span> <span class="token comment">/* 去掉input输入框的有边框 */</span> <span class="token property">border-right</span><span class="token punctuation">:</span> none<span class="token punctuation">;</span> <span class="token comment">/* 设置内边距,默认文字不要和左侧边框紧挨着 */</span> <span class="token property">padding-left</span><span class="token punctuation">:</span> 10px<span class="token punctuation">;</span> <span class="token comment">/* 设置input内部的字体的颜色和样式 */</span> <span class="token property">color</span><span class="token punctuation">:</span> #CCC<span class="token punctuation">;</span> <span class="token property">font-size</span><span class="token punctuation">:</span> 14px<span class="token punctuation">;</span> <span class="token property">border-top-left-radius</span><span class="token punctuation">:</span>10px<span class="token punctuation">;</span> <span class="token property">border-bottom-left-radius</span><span class="token punctuation">:</span>10px<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/* 先选中button标签, 直接设置标签的属性,先要选中, button:标签选择器*/</span> <span class="token selector">.container .search button</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">/* 设置left浮动 */</span> <span class="token property">float</span><span class="token punctuation">:</span> left<span class="token punctuation">;</span> <span class="token property">width</span><span class="token punctuation">:</span> 150px<span class="token punctuation">;</span> <span class="token property">height</span><span class="token punctuation">:</span> 52px<span class="token punctuation">;</span> <span class="token comment">/* 设置button的背景颜色,#4e6ef2 */</span> <span class="token property">background-color</span><span class="token punctuation">:</span> #4e6ef2<span class="token punctuation">;</span> <span class="token comment">/* 设置button中的字体颜色 */</span> <span class="token property">color</span><span class="token punctuation">:</span> #FFF<span class="token punctuation">;</span> <span class="token comment">/* 设置字体的大小 */</span> <span class="token property">font-size</span><span class="token punctuation">:</span> 19px<span class="token punctuation">;</span> <span class="token property">font-family</span><span class="token punctuation">:</span>Georgia<span class="token punctuation">,</span> <span class="token string">\'Times New Roman\'</span><span class="token punctuation">,</span> Times<span class="token punctuation">,</span> serif<span class="token punctuation">;</span> <span class="token property">border-top-right-radius</span><span class="token punctuation">:</span>10px<span class="token punctuation">;</span> <span class="token property">border-bottom-right-radius</span><span class="token punctuation">:</span>10px<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token selector">.container .result</span> <span class="token punctuation">{<!-- --></span> <span class="token property">width</span><span class="token punctuation">:</span> 100%<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token selector">.container .result .item</span> <span class="token punctuation">{<!-- --></span> <span class="token property">margin-top</span><span class="token punctuation">:</span> 15px<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token selector">.container .result .item a</span> <span class="token punctuation">{<!-- --></span> <span class="token comment">/* 设置为块级元素,单独站一行 */</span> <span class="token property">display</span><span class="token punctuation">:</span> block<span class="token punctuation">;</span> <span class="token comment">/* a标签的下划线去掉 */</span> <span class="token property">text-decoration</span><span class="token punctuation">:</span> none<span class="token punctuation">;</span> <span class="token comment">/* 设置a标签中的文字的字体大小 */</span> <span class="token property">font-size</span><span class="token punctuation">:</span> 20px<span class="token punctuation">;</span> <span class="token comment">/* 设置字体的颜色 */</span> <span class="token property">color</span><span class="token punctuation">:</span> #4e6ef2<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token selector">.container .result .item a:hover</span> <span class="token punctuation">{<!-- --></span> <span class="token property">text-decoration</span><span class="token punctuation">:</span> underline<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token selector">.container .result .item p</span> <span class="token punctuation">{<!-- --></span> <span class="token property">margin-top</span><span class="token punctuation">:</span> 5px<span class="token punctuation">;</span> <span class="token property">font-size</span><span class="token punctuation">:</span> 16px<span class="token punctuation">;</span> <span class="token property">font-family</span><span class="token punctuation">:</span><span class="token string">\'Lucida Sans\'</span><span class="token punctuation">,</span> <span class="token string">\'Lucida Sans Regular\'</span><span class="token punctuation">,</span> <span class="token string">\'Lucida Grande\'</span><span class="token punctuation">,</span> <span class="token string">\'Lucida Sans Unicode\'</span><span class="token punctuation">,</span> Geneva<span class="token punctuation">,</span> Verdana<span class="token punctuation">,</span> sans-serif<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token selector">.container .result .item i</span><span class="token punctuation">{<!-- --></span> <span class="token comment">/* 设置为块级元素,单独站一行 */</span> <span class="token property">display</span><span class="token punctuation">:</span> block<span class="token punctuation">;</span> <span class="token comment">/* 取消斜体风格 */</span> <span class="token property">font-style</span><span class="token punctuation">:</span> normal<span class="token punctuation">;</span> <span class="token property">color</span><span class="token punctuation">:</span> green<span class="token punctuation">;</span> <span class="token punctuation">}</span> </code></pre>
<p>这就是美化后的样子<br /> <img src="https://i-blog.csdnimg.cn/img_convert/f007ad6eb0198edbc96a845ea5f03e05.png" alt="手把手教你实现文档搜索引擎_starteam 文档搜索" alt="Pasted image 20250220172213" /><br /> 是不是非常有感觉了</p>
<h3>8.3 JS部分</h3>
<p>光美化可不行,要人这个网页可以连接上我们的服务器,为了做到这一点,我们就必须开始书写jsd代码。<br /> 我需要取得用户输入的数据,然后发起http请求和服务器进行交互。<br /> 注意:记得在前面引入<code>JQuery</code>库</p>
<pre><code class="prism language-js"><span class="token operator"><</span>script src<span class="token operator">=</span><span class="token string">\"http://code.jquery.com/jquery-2.1.1.min.js\"</span><span class="token operator">></span><span class="token operator"><</span><span class="token operator">/</span>script<span class="token operator">></span></code></pre>
<pre><code class="prism language-javascript"> <span class="token operator"><</span>script<span class="token operator">></span> <span class="token keyword">function</span> <span class="token function">Search</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span> <span class="token comment">// 1. 提取数据, $可以理解成就是JQuery的别称</span> <span class="token keyword">let</span> query <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">\".container .search input\"</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">val</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> console<span class="token punctuation">.</span><span class="token function">log</span><span class="token punctuation">(</span><span class="token string">\"query = \"</span> <span class="token operator">+</span> query<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//console是浏览器的对话框,可以用来进行查看js数据</span> <span class="token comment">//2. 发起http请求,ajax: 属于一个和后端进行数据交互的函数,JQuery中的</span> $<span class="token punctuation">.</span><span class="token function">ajax</span><span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span> <span class="token literal-property property">type</span><span class="token operator">:</span> <span class="token string">\"GET\"</span><span class="token punctuation">,</span> <span class="token literal-property property">url</span><span class="token operator">:</span> <span class="token string">\"/s?query=\"</span> <span class="token operator">+</span> query<span class="token punctuation">,</span> <span class="token function-variable function">success</span><span class="token operator">:</span> <span class="token keyword">function</span><span class="token punctuation">(</span><span class="token parameter">data</span><span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span>  console<span class="token punctuation">.</span><span class="token function">log</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">;</span>  <span class="token function">BuildHtml</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">function</span> <span class="token function">BuildHtml</span><span class="token punctuation">(</span><span class="token parameter">data</span><span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span> <span class="token comment">// 获取html中的result标签</span> <span class="token keyword">let</span> result_lable <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">\".container .result\"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 清空历史搜索结果</span> result_lable<span class="token punctuation">.</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">for</span><span class="token punctuation">(</span> <span class="token keyword">let</span> elem <span class="token keyword">of</span> data<span class="token punctuation">)</span><span class="token punctuation">{<!-- --></span> <span class="token comment">// console.log(elem.title);</span> <span class="token comment">// console.log(elem.url);</span> <span class="token keyword">let</span> a_lable <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">\"<a>\"</span><span class="token punctuation">,</span> <span class="token punctuation">{<!-- --></span>  <span class="token literal-property property">text</span><span class="token operator">:</span> elem<span class="token punctuation">.</span>title<span class="token punctuation">,</span>  <span class="token literal-property property">href</span><span class="token operator">:</span> elem<span class="token punctuation">.</span>url<span class="token punctuation">,</span>  <span class="token comment">// 跳转到新的页面</span>  <span class="token literal-property property">target</span><span class="token operator">:</span> <span class="token string">\"_blank\"</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">let</span> p_lable <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">\"<p>\"</span><span class="token punctuation">,</span> <span class="token punctuation">{<!-- --></span>  <span class="token literal-property property">text</span><span class="token operator">:</span> elem<span class="token punctuation">.</span>desc <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">let</span> i_lable <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">\"<i>\"</span><span class="token punctuation">,</span> <span class="token punctuation">{<!-- --></span>  <span class="token literal-property property">text</span><span class="token operator">:</span> elem<span class="token punctuation">.</span>url <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">let</span> div_lable <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">\"<div>\"</span><span class="token punctuation">,</span> <span class="token punctuation">{<!-- --></span>  <span class="token keyword">class</span><span class="token operator">:</span> <span class="token string">\"item\"</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> a_lable<span class="token punctuation">.</span><span class="token function">appendTo</span><span class="token punctuation">(</span>div_lable<span class="token punctuation">)</span><span class="token punctuation">;</span> p_lable<span class="token punctuation">.</span><span class="token function">appendTo</span><span class="token punctuation">(</span>div_lable<span class="token punctuation">)</span><span class="token punctuation">;</span> i_lable<span class="token punctuation">.</span><span class="token function">appendTo</span><span class="token punctuation">(</span>div_lable<span class="token punctuation">)</span><span class="token punctuation">;</span> div_lable<span class="token punctuation">.</span><span class="token function">appendTo</span><span class="token punctuation">(</span>result_lable<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token operator"><</span><span class="token operator">/</span>script<span class="token operator">></span></code></pre>
<h2>9.项目总结</h2>
<p>虽然本项目写的是boost搜索引擎,但是其中的思想可以运用到任何的文档搜索引擎中,如果你以后先编写其他文档的搜索引擎其实对于现在的代码并不需要修改太多的地方,大概我们只需要修改<code>parser.cc</code>中的构建url那段。其他地方的代码和文档根本也没什么关联,毕竟我们的程序只是一个工具,搜索什么都是搜索嘛。<br /> 全部写下来,还是会很有收获的。</p>
<h3>9.1 小吐槽</h3>
<p>可能因为我的云服务器配置太低了,每次运行到构建索引哪里都卡我十几分钟,所以我也学聪明了,要调试的话,我直接把<code>raw.txt</code>里的数据清空,直接秒开了,哈哈。<br /> 还有这个centos真是懒得喷啊,为什么要限制GCC的版本啊,等我看教程升级吧,下载<code>SCL</code>后,我直接安装不了软件了,一查才知道CentOS 7 的官方软件源在 2024 年 7 月 1 日之后已停止维护,导致默认的镜像源无法访问,换了清华源也不行,只能转战Ubuntu了。</p>
<h2>10.项目源码</h2>
<p>gitee</p>
</div>
				
				               	<div class="clear"></div>
                			

				                <div class="article_tags">
                	<div class="tagcloud">
                    	网络标签:<a href="http://www.csdndoc.com/tag/zy-11" rel="tag">摘要</a> <a href="http://www.csdndoc.com/tag/sy" rel="tag">索引</a> <a href="http://www.csdndoc.com/tag/zs-5" rel="tag">这是</a>                    </div>
                </div>
				
             </div>
		</div>
    

			
    
		<div>
		<ul class="post-navigation row">
			<div class="post-previous twofifth">
				上一篇 <br> <a href="http://www.csdndoc.com/thread/10278.html" rel="prev">Elasticsearch(es)在Windows系统上的安装与部署(含Kibana)_windows安装es</a>            </div>
            <div class="post-next twofifth">
				下一篇 <br> <a href="http://www.csdndoc.com/thread/10280.html" rel="next">Springboot集成ElasticSearch实现minio文件内容全文检索_minio 全文检索</a>            </div>
        </ul>
	</div>
	     
	<div class="article_container row  box article_related">
    	<div class="related">
		<div class="newrelated">
    <h2>相关问题</h2>
    <ul>
                        <li><a href="http://www.pcgg.com.cn/ys/44688.html">原神元素图标怎么认定</a></li>
                            <li><a href="http://www.pcgg.com.cn/lol/18799.html">lol手游国际版怎么登陆</a></li>
                            <li><a href="http://www.pcgg.com.cn/gpqq/10669.html">2022和平精英的金币军需宝箱在哪里</a></li>
                            <li><a href="http://www.pcgg.com.cn/aedfh/37758.html">艾尔登法环好用战技吗</a></li>
                            <li><a href="http://www.pcgg.com.cn/gpqq/10379.html">和平精英哪里能免费领改名卡(和平精英哪里能免费领改名卡会员)</a></li>
                            <li><a href="http://www.pcgg.com.cn/xjzb/55793.html">星际争霸比赛16强在哪</a></li>
                            <li><a href="http://www.pcgg.com.cn/ys/44871.html">原神怎么对准星盘</a></li>
                            <li><a href="http://www.pcgg.com.cn/lol/16335.html">英雄联盟青龙盒子会封号吗</a></li>
                            <li><a href="http://www.pcgg.com.cn/gl/2009.html">原神掠食者弓在手机上能用吗?</a></li>
                            <li><a href="http://www.pcgg.com.cn/gpqq/6600.html">腾讯游戏助手的和平精英为什么进不去</a></li>
                </ul>
</div>
       	</div>
	</div>
         	<div class="clear"></div>
	<div id="comments_box">

    </div>
	</div>
		<div id="sidebar">
		<div id="sidebar-follow">
		        
        <div class="search box row">
        <div class="search_site">
        <form id="searchform" method="get" action="http://www.csdndoc.com/index.php">
            <button type="submit" value="" id="searchsubmit" class="button"><i class="fasearch">☚</i></button>
            <label><input type="text" class="search-s" name="s" x-webkit-speech="" placeholder="请输入搜索内容"></label>
        </form></div></div>
        <div class="widget_text widget box row widget_custom_html"><h3>公告</h3><div class="textwidget custom-html-widget"><a target="_blank" href="http://www.5d.ink/deepseek/?d=DeepseekR1_local.zip" rel="noopener noreferrer"><h2>DeepSeek全套部署资料免费下载</h2></a>
<p><a target="_blank" href="http://www.5d.ink/deepseek/?d=DeepseekR1_local.zip" rel="noopener noreferrer"><img src="http://css.5d.ink/img/deep.png" alt="DeepSeekR1本地部署部署资料免费下载"></a></p><br /><br />
<a target="_blank" href="http://www.5d.ink/freefonts/?d=FreeFontsdown.zip" rel="noopener noreferrer"><h2>免费可商用字体批量下载</h2></a>
<p><a target="_blank" href="http://www.5d.ink/freefonts/?d=FreeFontsdown.zip" rel="noopener noreferrer"><img src="http://css.5d.ink/img/freefont.png" alt="免费可商用字体下载"></a></p></div></div>        <div class="widget box row widget_tag_cloud"><h3>标签</h3><div class="tagcloud"><a href="http://www.csdndoc.com/tag/ck-2" class="tag-cloud-link tag-link-237 tag-link-position-1" style="font-size: 8.5773195876289pt;" aria-label="仓库 (345个项目)">仓库</a>
<a href="http://www.csdndoc.com/tag/dm" class="tag-cloud-link tag-link-47 tag-link-position-2" style="font-size: 16.515463917526pt;" aria-label="代码 (1,216个项目)">代码</a>
<a href="http://www.csdndoc.com/tag/ys" class="tag-cloud-link tag-link-62 tag-link-position-3" style="font-size: 10.164948453608pt;" aria-label="元素 (447个项目)">元素</a>
<a href="http://www.csdndoc.com/tag/hs" class="tag-cloud-link tag-link-38 tag-link-position-4" style="font-size: 14.350515463918pt;" aria-label="函数 (868个项目)">函数</a>
<a href="http://www.csdndoc.com/tag/gn" class="tag-cloud-link tag-link-48 tag-link-position-5" style="font-size: 9.0103092783505pt;" aria-label="功能 (373个项目)">功能</a>
<a href="http://www.csdndoc.com/tag/qk" class="tag-cloud-link tag-link-324 tag-link-position-6" style="font-size: 9.1546391752577pt;" aria-label="区块 (376个项目)">区块</a>
<a href="http://www.csdndoc.com/tag/cs" class="tag-cloud-link tag-link-25 tag-link-position-7" style="font-size: 9.1546391752577pt;" aria-label="参数 (377个项目)">参数</a>
<a href="http://www.csdndoc.com/tag/ml" class="tag-cloud-link tag-link-4 tag-link-position-8" style="font-size: 11.896907216495pt;" aria-label="命令 (590个项目)">命令</a>
<a href="http://www.csdndoc.com/tag/tx" class="tag-cloud-link tag-link-130 tag-link-position-9" style="font-size: 9.4432989690722pt;" aria-label="图像 (395个项目)">图像</a>
<a href="http://www.csdndoc.com/tag/zzl" class="tag-cloud-link tag-link-20 tag-link-position-10" style="font-size: 21.422680412371pt;" aria-label="在这里 (2,688个项目)">在这里</a>
<a href="http://www.csdndoc.com/tag/dz" class="tag-cloud-link tag-link-196 tag-link-position-11" style="font-size: 10.020618556701pt;" aria-label="地址 (432个项目)">地址</a>
<a href="http://www.csdndoc.com/tag/khd" class="tag-cloud-link tag-link-28 tag-link-position-12" style="font-size: 8.5773195876289pt;" aria-label="客户端 (344个项目)">客户端</a>
<a href="http://www.csdndoc.com/tag/rq" class="tag-cloud-link tag-link-215 tag-link-position-13" style="font-size: 11.030927835052pt;" aria-label="容器 (514个项目)">容器</a>
<a href="http://www.csdndoc.com/tag/dx" class="tag-cloud-link tag-link-34 tag-link-position-14" style="font-size: 9.1546391752577pt;" aria-label="对象 (379个项目)">对象</a>
<a href="http://www.csdndoc.com/tag/gj" class="tag-cloud-link tag-link-43 tag-link-position-15" style="font-size: 10.164948453608pt;" aria-label="工具 (441个项目)">工具</a>
<a href="http://www.csdndoc.com/tag/kfz" class="tag-cloud-link tag-link-294 tag-link-position-16" style="font-size: 11.175257731959pt;" aria-label="开发者 (529个项目)">开发者</a>
<a href="http://www.csdndoc.com/tag/js" class="tag-cloud-link tag-link-283 tag-link-position-17" style="font-size: 10.59793814433pt;" aria-label="技术 (475个项目)">技术</a>
<a href="http://www.csdndoc.com/tag/jk" class="tag-cloud-link tag-link-252 tag-link-position-18" style="font-size: 8.5773195876289pt;" aria-label="接口 (345个项目)">接口</a>
<a href="http://www.csdndoc.com/tag/cj" class="tag-cloud-link tag-link-68 tag-link-position-19" style="font-size: 8pt;" aria-label="插件 (316个项目)">插件</a>
<a href="http://www.csdndoc.com/tag/crtp" class="tag-cloud-link tag-link-42 tag-link-position-20" style="font-size: 16.80412371134pt;" aria-label="插入图片 (1,273个项目)">插入图片</a>
<a href="http://www.csdndoc.com/tag/cz-3" class="tag-cloud-link tag-link-513 tag-link-position-21" style="font-size: 8.8659793814433pt;" aria-label="操作 (363个项目)">操作</a>
<a href="http://www.csdndoc.com/tag/sj" class="tag-cloud-link tag-link-55 tag-link-position-22" style="font-size: 22pt;" aria-label="数据 (2,939个项目)">数据</a>
<a href="http://www.csdndoc.com/tag/sjk" class="tag-cloud-link tag-link-124 tag-link-position-23" style="font-size: 10.164948453608pt;" aria-label="数据库 (446个项目)">数据库</a>
<a href="http://www.csdndoc.com/tag/sz-3" class="tag-cloud-link tag-link-186 tag-link-position-24" style="font-size: 9.4432989690722pt;" aria-label="数组 (396个项目)">数组</a>
<a href="http://www.csdndoc.com/tag/wj" class="tag-cloud-link tag-link-81 tag-link-position-25" style="font-size: 18.247422680412pt;" aria-label="文件 (1,619个项目)">文件</a>
<a href="http://www.csdndoc.com/tag/ff" class="tag-cloud-link tag-link-18 tag-link-position-26" style="font-size: 11.175257731959pt;" aria-label="方法 (525个项目)">方法</a>
<a href="http://www.csdndoc.com/tag/fwq" class="tag-cloud-link tag-link-147 tag-link-position-27" style="font-size: 13.340206185567pt;" aria-label="服务器 (748个项目)">服务器</a>
<a href="http://www.csdndoc.com/tag/mx" class="tag-cloud-link tag-link-69 tag-link-position-28" style="font-size: 19.40206185567pt;" aria-label="模型 (1,962个项目)">模型</a>
<a href="http://www.csdndoc.com/tag/cs-2" class="tag-cloud-link tag-link-58 tag-link-position-29" style="font-size: 12.907216494845pt;" aria-label="测试 (684个项目)">测试</a>
<a href="http://www.csdndoc.com/tag/xx-2" class="tag-cloud-link tag-link-35 tag-link-position-30" style="font-size: 8.1443298969072pt;" aria-label="消息 (320个项目)">消息</a>
<a href="http://www.csdndoc.com/tag/bb" class="tag-cloud-link tag-link-6 tag-link-position-31" style="font-size: 13.340206185567pt;" aria-label="版本 (738个项目)">版本</a>
<a href="http://www.csdndoc.com/tag/zt" class="tag-cloud-link tag-link-79 tag-link-position-32" style="font-size: 8pt;" aria-label="状态 (313个项目)">状态</a>
<a href="http://www.csdndoc.com/tag/hj" class="tag-cloud-link tag-link-3 tag-link-position-33" style="font-size: 9.8762886597938pt;" aria-label="环境 (421个项目)">环境</a>
<a href="http://www.csdndoc.com/tag/yh" class="tag-cloud-link tag-link-44 tag-link-position-34" style="font-size: 14.20618556701pt;" aria-label="用户 (845个项目)">用户</a>
<a href="http://www.csdndoc.com/tag/sl" class="tag-cloud-link tag-link-17 tag-link-position-35" style="font-size: 10.164948453608pt;" aria-label="示例 (448个项目)">示例</a>
<a href="http://www.csdndoc.com/tag/cx" class="tag-cloud-link tag-link-31 tag-link-position-36" style="font-size: 9.7319587628866pt;" aria-label="程序 (414个项目)">程序</a>
<a href="http://www.csdndoc.com/tag/sf" class="tag-cloud-link tag-link-108 tag-link-position-37" style="font-size: 9.7319587628866pt;" aria-label="算法 (412个项目)">算法</a>
<a href="http://www.csdndoc.com/tag/xt" class="tag-cloud-link tag-link-96 tag-link-position-38" style="font-size: 13.484536082474pt;" aria-label="系统 (762个项目)">系统</a>
<a href="http://www.csdndoc.com/tag/xc" class="tag-cloud-link tag-link-19 tag-link-position-39" style="font-size: 8.7216494845361pt;" aria-label="线程 (350个项目)">线程</a>
<a href="http://www.csdndoc.com/tag/zj" class="tag-cloud-link tag-link-192 tag-link-position-40" style="font-size: 9.8762886597938pt;" aria-label="组件 (422个项目)">组件</a>
<a href="http://www.csdndoc.com/tag/jd" class="tag-cloud-link tag-link-12 tag-link-position-41" style="font-size: 14.061855670103pt;" aria-label="节点 (825个项目)">节点</a>
<a href="http://www.csdndoc.com/tag/sb" class="tag-cloud-link tag-link-160 tag-link-position-42" style="font-size: 9.7319587628866pt;" aria-label="设备 (413个项目)">设备</a>
<a href="http://www.csdndoc.com/tag/lj" class="tag-cloud-link tag-link-22 tag-link-position-43" style="font-size: 10.164948453608pt;" aria-label="路径 (445个项目)">路径</a>
<a href="http://www.csdndoc.com/tag/jx" class="tag-cloud-link tag-link-213 tag-link-position-44" style="font-size: 11.896907216495pt;" aria-label="镜像 (588个项目)">镜像</a>
<a href="http://www.csdndoc.com/tag/xm" class="tag-cloud-link tag-link-171 tag-link-position-45" style="font-size: 14.494845360825pt;" aria-label="项目 (891个项目)">项目</a></div>
</div>        <div class="widget box row">
            <div id="tab-title">
                <div class="tab">
                    <ul id="tabnav">
                        <li  class="selected">猜你想看的文章</li>
                    </ul>
                </div>
                <div class="clear"></div>
            </div>
            <div id="tab-content">
                <ul>
                                                <li><a href="http://www.pcgg.com.cn/lol/17741.html">lol黄金2隐藏分多少</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/gl/1336.html">原神九条娑罗几星</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/gpqq/8687.html">和平精英新版本怎么造车</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/cf/34915.html">cf哪个狙击枪好</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/gpqq/13021.html">和平精英电脑版和手游同步吗?</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/gl/1646.html">原神男女主可以切换吗</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/gpqq/9407.html">和平精英画质助手正版下载最新版</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/gpqq/13185.html">和平精英巡查挑战视频都是违规吗</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/gpqq/11083.html">和平精英灵敏度怎么调压枪稳二指</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/gpqq/8119.html">和平精英体验服号怎么申请(和平精英体验服下载入口)</a></li>
                                        </ul>
            </div>
        </div>
        									</div>
	</div>
</div>
</div>
<div class="clear"></div>
<div id="footer">
<div class="container">
	<div class="twothird">
      </div>

</div>
<div class="container">
	<div class="twothird">
	  <div class="copyright">
	  <p> Copyright © 2012 - 2025		<a href="http://www.csdndoc.com/"><strong>程序员档案馆</strong></a> Powered by <a href="/lists">网站分类目录</a> | <a href="/top100.php" target="_blank">精选推荐文章</a> | <a href="/sitemap.xml" target="_blank">网站地图</a>  | <a href="/post/" target="_blank">疑难解答</a>

				<a href="https://beian.miit.gov.cn/" rel="external">京ICP备05034492号</a>
		 	  </p>
	  <p>声明:本站内容来自互联网,如信息有错误可发邮件到f_fb#foxmail.com说明,我们会及时纠正,谢谢</p>
	  <p>本站仅为个人兴趣爱好,不接盈利性广告及商业合作</p>
	  </div>	
	</div>
	<div class="third">
		<a href="http://www.xiaoboy.cn" target="_blank">小男孩</a>			
	</div>
</div>
</div>
<!--gototop-->
<div id="tbox">
    <a id="home" href="http://www.csdndoc.com" title="返回首页"><i class="fa fa-gohome"></i></a>
      <a id="pinglun" href="#comments_box" title="前往评论"><i class="fa fa-commenting"></i></a>
   
  <a id="gotop" href="javascript:void(0)" title="返回顶部"><i class="fa fa-chevron-up"></i></a>
</div>
<script src="//css.5d.ink/body5.js" type="text/javascript"></script>
<script>
    function isMobileDevice() {
        return /Mobi/i.test(navigator.userAgent) || /Android/i.test(navigator.userAgent) || /iPhone|iPad|iPod/i.test(navigator.userAgent) || /Windows Phone/i.test(navigator.userAgent);
    }
    // 加载对应的 JavaScript 文件
    if (isMobileDevice()) {
        var script = document.createElement('script');
        script.src = '//css.5d.ink/js/menu.js';
        script.type = 'text/javascript';
        document.getElementsByTagName('head')[0].appendChild(script);
    }
</script>
<script>
$(document).ready(function() { 
 $("#sidebar-follow").pin({
      containerSelector: ".main-container",
	  padding: {top:64},
	  minWidth: 768
	}); 
 $(".mainmenu").pin({
	 containerSelector: ".container",
	  padding: {top:0}
	});
 $(".swipebox").swipebox();	
});
</script>

 </body></html>