> 技术文档 > 用 Python 写你的第一个爬虫:小白也能轻松搞定数据抓取(超详细包含最新所有Python爬虫库的教程)_使用python写一个爬虫

用 Python 写你的第一个爬虫:小白也能轻松搞定数据抓取(超详细包含最新所有Python爬虫库的教程)_使用python写一个爬虫

用 Python 写你的第一个爬虫:小白也能轻松搞定数据抓取(超详细包含最新所有Python爬虫库的教程)

摘要

本文是一篇面向爬虫爱好者的超详细 Python 爬虫入门教程,涵盖了从基础到进阶的所有关键技术点:使用 Requests 与 BeautifulSoup 实现静态网页数据抓取,运用 lxml、XPath、CSS 选择器等高效解析技术,深入 Scrapy 框架搭建分布式爬虫项目,掌握 Selenium 和 Playwright 浏览器自动化处理 JS 动态渲染,探索 aiohttp、HTTPX 异步爬虫提升并发性能,并结合代理 IP 池、User-Agent 伪装、验证码识别等反爬虫策略应对电商数据抓取、新闻数据爬取、社交媒体采集等场景。快速上手大规模爬虫项目,打造可扩展、高效稳定的数据抓取解决方案。


用 Python 写你的第一个爬虫:小白也能轻松搞定数据抓取(超详细包含最新所有Python爬虫库的教程)_使用python写一个爬虫

目录

  1. 前言

  2. 爬虫基础知识

    • 2.1 什么是爬虫?
    • 2.2 爬虫的应用场景
    • 2.3 爬虫基本流程
    • 2.4 需要注意的法律与伦理问题
  3. 开发环境准备

    • 3.1 安装 Python(建议 3.8 及以上)
    • 3.2 创建虚拟环境并激活
    • 3.3 常用开发工具推荐
  4. 基础篇:用 Requests + BeautifulSoup 做简单爬虫

    • 4.1 安装必要库
    • 4.2 认识 HTTP 请求与响应
    • 4.3 编写第一个爬虫:抓取网页标题
    • 4.4 解析HTML:BeautifulSoup 用法详解
    • 4.5 文件存储:将抓到的数据保存为 CSV/JSON
    • 4.6 常见反爬措施及应对策略
  5. 进阶篇:更强大的解析工具

    • 5.1 lxml (XPath)
    • 5.2 parsel(Scrapy 内置的解析器)
    • 5.3 PyQuery(类似 jQuery 的解析方式)
    • 5.4 正则表达式在爬虫中的应用
  6. 框架篇:Scrapy 全面入门

    • 6.1 Scrapy 简介
    • 6.2 安装与项目结构
    • 6.3 编写第一个 Scrapy 爬虫 Spider
    • 6.4 Item、Pipeline、Settings 详解
    • 6.5 Scrapy Shell 在线调试
    • 6.6 分布式与多线程:Scrapy 爬虫并发配置
    • 6.7 Scrapy 中间件与扩展(Downloader Middleware、Downloader Handler)
  7. 动态内容爬取:Selenium 与 Playwright

    • 7.1 为什么需要浏览器自动化?
    • 7.2 Selenium 基础用法
    • 7.3 Playwright for Python(更快更轻量)
    • 7.4 无头浏览器(headless)模式及性能优化
    • 7.5 结合 Selenium/Playwright 与 BeautifulSoup 解析
  8. 异步爬虫:aiohttp + asyncio 与 HTTPX

    • 8.1 同步 vs 异步:性能原理简述
    • 8.2 aiohttp 入门示例
    • 8.3 使用 asyncio 协程池提高并发
    • 8.4 HTTPX:Requests 的异步升级版
    • 8.5 异步下使用解析库示例(aiohttp + lxml)
  9. 数据存储与去重

    • 9.1 本地文件:CSV、JSON、SQLite
    • 9.2 MySQL/PostgreSQL 等关系型数据库
    • 9.3 MongoDB 等 NoSQL 存储
    • 9.4 Redis 用作去重与短期缓存
    • 9.5 去重策略:指纹、哈希、Bloom Filter
  10. 分布式爬虫:Scrapy-Redis 与分布式调度

    • 10.1 为什么要做分布式?
    • 10.2 Scrapy-Redis 简介与安装
    • 10.3 分布式去重队列与调度
    • 10.4 多机协作示例
  11. 常见反爬与反制策略

    • 11.1 频率限制与请求头伪装
    • 11.2 登录验证与 Cookie 管理
    • 11.3 验证码识别(简单介绍)
    • 11.4 代理 IP 池的搭建与旋转
  12. 完整案例:爬取某新闻网站并存入数据库

    • 12.1 需求分析
    • 12.2 使用 Scrapy + MySQL 完整实现
    • 12.3 代码详解与常见 Q&A
  13. Python 爬虫相关的常用第三方库一览(截至 2024 年底)

    • 13.1 基础请求与解析
    • 13.2 浏览器自动化
    • 13.3 异步爬取
    • 13.4 登录模拟与验证码处理
    • 13.5 反爬与代理
    • 13.6 分布式调度
    • 13.7 其它有用工具
  14. 附录

    • 14.1 常见报错及解决方案
    • 14.2 常用 HTTP 状态码速查
    • 14.3 学习资源与进阶指南
  15. 总结


1. 前言

在信息爆炸的时代,互联网早已成为最丰富、最便捷的数据来源。从电商平台的商品价格到新闻网站的最新动态,从社交媒体的热门话题到招聘网站的职位信息,只要你想得到,几乎都能通过爬虫从网页里“扒”出来。对于初学者而言,爬虫其实并不神秘:只要理解 HTTP、HTML 及基本的 Python 编程,就能快速入门。本教程面向“零基础”“小白”用户,讲解从最基本的抓取到进阶框架、异步、分布式再到反爬策略,逐步深入,手把手指导你搭建完整爬虫,并总结截至 2025 年最常用的 Python 爬虫库。

本教程特色

  • 循序渐进:从最简单的 requests + BeautifulSoup 开始,到 Scrapy、Selenium、Playwright、异步爬虫,一步步掌握。
  • 超详细示例:每个工具/框架都配有完整可运行的示例代码,你可以直接复制、运行、观察。
  • 最新库盘点:整理并介绍了截至 2025 年所见的常用爬虫生态中的主流库,助你选对最合适的工具。
  • 反爬与实战:从简单的 User-Agent 伪装到代理 IP 池、验证码识别、分布式部署,多角度应对目标网站的各种反爬机制。

温馨提示

  1. 本教程示例均基于 Python 3.8+,强烈建议使用 Python 3.10 或更高版本来获得更好的兼容性与性能。
  2. 爬取网站数据时,请务必遵守目标网站的 robots.txt 以及相关法律法规,避免给他人服务器带来不必要的压力。
  3. 本文所列“最新库”信息截止到 2024 年底,2025 年及以后的新库、新特性请结合官方文档或社区资源进行补充。

2. 爬虫基础知识

2.1 什么是爬虫?

  • 定义:爬虫(Web Crawler,也称 Spider、Bot)是一种通过程序自动访问网页,并将其中有用信息提取下来存储的数据采集工具。
  • 原理简述:爬虫首先向指定 URL 发起 HTTP 请求,获取网页源代码(HTML、JSON、图片等),再通过解析技术(如 XPath、CSS 选择器、正则)从源码中提取所需数据,最后将数据保存到文件或数据库中。

2.2 爬虫的应用场景

  1. 数据分析:电商价格监控、商品评论分析、竞品调研。
  2. 舆情监控:社交媒体热搜、论坛帖子、新闻资讯统计。
  3. 搜索引擎:Google、Bing、Baidu 等搜索引擎通过爬虫定期抓取网页进行索引。
  4. 招聘信息采集:自动抓取招聘网站的岗位、薪资、公司信息。
  5. 学术研究:论文元数据爬取、知识图谱构建等。
  6. 内容聚合:如各类聚合网站把分散站点的文章集中到一个平台。

2.3 爬虫基本流程

  1. 确定目标 URL:明确要爬取的网页地址,可能是静态页面,也可能是动态加载。
  2. 发送 HTTP 请求:通常使用 requestshttpxaiohttp 等库向目标 URL 发送 GET、POST 请求,并获取响应。
  3. 解析响应内容:响应可能是 HTML、JSON、XML、图片等,常用解析工具有 BeautifulSoup、lxml、parsel、PyQuery、正则表达式等。
  4. 提取数据:根据标签名、属性、XPath、CSS Selector 等定位到目标内容,抽取文本或属性。
  5. 数据处理与存储:将提取到的内容清洗、去重,然后保存到 CSV、JSON、SQLite、MySQL、MongoDB 等介质中。
  6. 翻页/递归:如果需要多个页面的数据,就要分析翻页逻辑(URL 模板、Ajax 请求),循环执行请求与解析。
  7. 异常处理与反爬对策:设置代理、随机 User-Agent、限速、IP 轮换,处理 HTTP 403、验证码、重定向等。

2.4 需要注意的法律与伦理问题

  • 请求前务必查看目标站点的 robots.txt(通常在 https://example.com/robots.txt),遵从抓取规则;
  • 有些站点禁止大量抓取、禁止商业用途,在爬取前请阅读并遵守版权与隐私政策;
  • 不要对目标站点造成过大压力,建议设置合适的延时(time.sleep)、并发数限制;
  • 遵守爬虫与爬取数据后续处理相关法律法规,切勿用于违法用途。

3. 开发环境准备

3.1 安装 Python(建议 3.8 及以上)

  1. Windows

    • 前往 https://www.python.org/downloads 下载对应 3.8+ 的安装包,默认选中“Add Python 3.x to PATH”,点击“Install Now”。

    • 安装完成后,打开命令行(Win + R → 输入 cmd → 回车),执行:

      python --versionpip --version

      确认 Python 与 pip 已成功安装。

  2. macOS

    • 建议使用 Homebrew 安装:

      brew install python@3.10
    • 安装完成后,执行:

      python3 --versionpip3 --version

      确认无误后即可。

  3. Linux (Ubuntu/Debian 系)

    sudo apt updatesudo apt install python3 python3-pip python3-venv -y

    执行:

    python3 --versionpip3 --version

    即可确认。

提示:如果你机器上同时安装了 Python 2.x 和 Python 3.x,可能需要使用 python3pip3 来替代 pythonpip

3.2 创建虚拟环境并激活

为了避免全局依赖冲突,强烈建议为每个爬虫项目创建独立的虚拟环境:

# 进入项目根目录mkdir my_spider && cd my_spider# 在项目目录下创建虚拟环境(python3 -m venv venv 或 python -m venv venv)python3 -m venv venv# 激活虚拟环境# Windows:venv\\Scripts\\activate# macOS/Linux:source venv/bin/activate

激活后,终端左侧会显示 (venv),此时安装的所有包都只作用于该环境。

3.3 常用开发工具推荐

  • IDE/编辑器

    • PyCharm Community / Professional:功能强大,集成测试、版本管理。
    • VS Code:轻量且插件丰富,适合快速编辑。
    • Sublime Text:轻量,启动快;对于小脚本很方便。
  • 调试工具

    • VS Code/PyCharm 自带的调试器,可以单步、断点调试。
    • 对于命令行脚本,也可以使用 pdb
  • 版本管理

    • Git + VS Code / PyCharm Git 插件,实现代码托管与协作。
    • 将项目托管到 GitHub/Gitee 等。
  • 其他辅助

    • Postman / Insomnia:用于模拟 HTTP 请求、查看响应头;
    • Charles / Fiddler:抓包工具,可调试 AJAX 请求、Cookie、headers 等。

4. 基础篇:用 Requests + BeautifulSoup 做简单爬虫

4.1 安装必要库

在虚拟环境中,执行:

pip install requests beautifulsoup4 lxml
  • requests:Python 最常用的 HTTP 库,用于发送 GET/POST 请求。
  • beautifulsoup4:常见的 HTML/XML 解析库,入门简单。
  • lxml:速度快、功能强大的解析器,供 BeautifulSoup 使用。

4.2 认识 HTTP 请求与响应

  • HTTP 请求:由方法(GET、POST、PUT 等)、URL、请求头(Headers)、请求体(Body)等组成。

  • HTTP 响应:包含状态码(200、404、500 等)、响应头、响应体(通常为 HTML、JSON、图片、文件等)。

  • Requests 常用参数

    • url:请求地址。
    • params:URL 参数(字典/字符串)。
    • headers:自定义请求头(例如 User-Agent、Referer、Cookie)。
    • data / json:POST 请求时发送的表单或 JSON 数据。
    • timeout:超时时间(秒),防止请求一直卡住。
    • proxies:配置代理(详见后文)。

示例:

import requestsurl = \'https://httpbin.org/get\'params = {\'q\': \'python 爬虫\', \'page\': 1}headers = { \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...\'}response = requests.get(url, params=params, headers=headers, timeout=10)print(response.status_code) # 打印状态码,例如 200print(response.encoding) # 编码,例如 \'utf-8\'print(response.text[:200]) # 前 200 字符

4.3 编写第一个爬虫:抓取网页标题

下面以爬取「https://www.example.com」网页标题为例,演示最简单的流程:

# file: simple_spider.pyimport requestsfrom bs4 import BeautifulSoupdef fetch_title(url): try: # 1. 发送 GET 请求 headers = { \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...\' } response = requests.get(url, headers=headers, timeout=10) response.raise_for_status() # 如果状态码不是 200,引发 HTTPError # 2. 设置正确的编码 response.encoding = response.apparent_encoding # 3. 解析 HTML soup = BeautifulSoup(response.text, \'lxml\') # 4. 提取  标签内容</span> title_tag <span class="token operator">=</span> soup<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">\'title\'</span><span class="token punctuation">)</span> <span class="token keyword">if</span> title_tag<span class="token punctuation">:</span> <span class="token keyword">return</span> title_tag<span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> <span class="token keyword">return</span> <span class="token string">\'未找到 title 标签\'</span> <span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span> <span class="token keyword">return</span> <span class="token string-interpolation"><span class="token string">f\'抓取失败:</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>e<span class="token punctuation">}</span></span><span class="token string">\'</span></span><span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">\'__main__\'</span><span class="token punctuation">:</span> url <span class="token operator">=</span> <span class="token string">\'https://www.example.com\'</span> title <span class="token operator">=</span> fetch_title<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f\'网页标题:</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>title<span class="token punctuation">}</span></span><span class="token string">\'</span></span><span class="token punctuation">)</span></code></pre>
<p><strong>运行结果示例</strong>:</p>
<pre><code class="prism language-bash"><span class="token punctuation">(</span>venv<span class="token punctuation">)</span> $ python simple_spider.py网页标题:Example Domain</code></pre>
<h4>4.4 解析HTML:BeautifulSoup 用法详解</h4>
<p><code>BeautifulSoup</code> 库使用简单,常用方法如下:</p>
<ol>
<li>
<p><strong>创建对象</strong></p>
<pre><code class="prism language-python">soup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html_text<span class="token punctuation">,</span> <span class="token string">\'lxml\'</span><span class="token punctuation">)</span> <span class="token comment"># 或 \'html.parser\'</span></code></pre>
</li>
<li>
<p><strong>查找单个节点</strong></p>
<ul>
<li><code>soup.find(tag_name, attrs={}, recursive=True, text=None, **kwargs)</code></li>
<li>示例:<code>soup.find(\'div\', class_=\'content\')</code></li>
<li>可以使用 <code>attrs={\'class\': \'foo\', \'id\': \'bar\'}</code> 精确定位。</li>
</ul>
</li>
<li>
<p><strong>查找所有节点</strong></p>
<ul>
<li><code>soup.find_all(tag_name, attrs={}, limit=None, **kwargs)</code></li>
<li>示例:<code>soup.find_all(\'a\', href=True)</code> 返回所有带 <code>href</code> 的链接。</li>
</ul>
</li>
<li>
<p><strong>CSS 选择器</strong></p>
<ul>
<li><code>soup.select(\'div.content > ul li a\')</code>,返回列表。</li>
<li>支持 id(<code>#id</code>)、class(<code>.class</code>)、属性(<code>[attr=value]</code>)等。</li>
</ul>
</li>
<li>
<p><strong>获取属性或文本</strong></p>
<ul>
<li><code>node.get(\'href\')</code>:拿属性值;</li>
<li><code>node[\'href\']</code>:同上,但如果属性不存在会抛异常;</li>
<li><code>node.get_text(strip=True)</code>:获取节点文本,并去除前后空白;</li>
<li><code>node.text</code>:获取节点及子节点合并文本。</li>
</ul>
</li>
<li>
<p><strong>常用属性</strong></p>
<ul>
<li><code>soup.title</code> / <code>soup.title.string</code> / <code>soup.title.text</code></li>
<li><code>soup.body</code> / <code>soup.head</code> / <code>soup.a</code> / <code>soup.div</code> 等快捷属性。</li>
</ul>
</li>
<li>
<p><strong>示例:提取列表页所有文章链接</strong></p>
<pre><code class="prism language-python">html <span class="token operator">=</span> response<span class="token punctuation">.</span>textsoup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">\'lxml\'</span><span class="token punctuation">)</span><span class="token comment"># 假设每篇文章链接都在 <h2 class="post-title"><a href="...">...</a></h2></span><span class="token keyword">for</span> h2 <span class="token keyword">in</span> soup<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">\'h2\'</span><span class="token punctuation">,</span> class_<span class="token operator">=</span><span class="token string">\'post-title\'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> a_tag <span class="token operator">=</span> h2<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">\'a\'</span><span class="token punctuation">)</span> title <span class="token operator">=</span> a_tag<span class="token punctuation">.</span>get_text<span class="token punctuation">(</span>strip<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> link <span class="token operator">=</span> a_tag<span class="token punctuation">[</span><span class="token string">\'href\'</span><span class="token punctuation">]</span> <span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">,</span> link<span class="token punctuation">)</span></code></pre>
</li>
</ol>
<h4>4.5 文件存储:将抓到的数据保存为 CSV/JSON</h4>
<ol>
<li>
<p><strong>CSV 格式</strong></p>
<pre><code class="prism language-python"><span class="token keyword">import</span> csvdata <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">{<!-- --></span><span class="token string">\'title\'</span><span class="token punctuation">:</span> <span class="token string">\'第一篇\'</span><span class="token punctuation">,</span> <span class="token string">\'url\'</span><span class="token punctuation">:</span> <span class="token string">\'https://...\'</span><span class="token punctuation">}</span><span class="token punctuation">,</span> <span class="token punctuation">{<!-- --></span><span class="token string">\'title\'</span><span class="token punctuation">:</span> <span class="token string">\'第二篇\'</span><span class="token punctuation">,</span> <span class="token string">\'url\'</span><span class="token punctuation">:</span> <span class="token string">\'https://...\'</span><span class="token punctuation">}</span><span class="token punctuation">,</span> <span class="token comment"># ...</span><span class="token punctuation">]</span><span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">\'result.csv\'</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">\'w\'</span><span class="token punctuation">,</span> newline<span class="token operator">=</span><span class="token string">\'\'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">\'utf-8-sig\'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> fieldnames <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">\'title\'</span><span class="token punctuation">,</span> <span class="token string">\'url\'</span><span class="token punctuation">]</span> writer <span class="token operator">=</span> csv<span class="token punctuation">.</span>DictWriter<span class="token punctuation">(</span>f<span class="token punctuation">,</span> fieldnames<span class="token operator">=</span>fieldnames<span class="token punctuation">)</span> writer<span class="token punctuation">.</span>writeheader<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">for</span> item <span class="token keyword">in</span> data<span class="token punctuation">:</span> writer<span class="token punctuation">.</span>writerow<span class="token punctuation">(</span>item<span class="token punctuation">)</span></code></pre>
<ul>
<li><code>encoding=\'utf-8-sig\'</code> 能兼容 Excel 打开时不出现乱码。</li>
</ul>
</li>
<li>
<p><strong>JSON 格式</strong></p>
<pre><code class="prism language-python"><span class="token keyword">import</span> jsondata <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">{<!-- --></span><span class="token string">\'title\'</span><span class="token punctuation">:</span> <span class="token string">\'第一篇\'</span><span class="token punctuation">,</span> <span class="token string">\'url\'</span><span class="token punctuation">:</span> <span class="token string">\'https://...\'</span><span class="token punctuation">}</span><span class="token punctuation">,</span> <span class="token punctuation">{<!-- --></span><span class="token string">\'title\'</span><span class="token punctuation">:</span> <span class="token string">\'第二篇\'</span><span class="token punctuation">,</span> <span class="token string">\'url\'</span><span class="token punctuation">:</span> <span class="token string">\'https://...\'</span><span class="token punctuation">}</span><span class="token punctuation">,</span> <span class="token comment"># ...</span><span class="token punctuation">]</span><span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">\'result.json\'</span><span class="token punctuation">,</span> <span class="token string">\'w\'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">\'utf-8\'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> json<span class="token punctuation">.</span>dump<span class="token punctuation">(</span>data<span class="token punctuation">,</span> f<span class="token punctuation">,</span> ensure_ascii<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> indent<span class="token operator">=</span><span class="token number">4</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><strong>SQLite 存储</strong>(适合小规模项目)</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> sqlite3conn <span class="token operator">=</span> sqlite3<span class="token punctuation">.</span>connect<span class="token punctuation">(</span><span class="token string">\'spider.db\'</span><span class="token punctuation">)</span>cursor <span class="token operator">=</span> conn<span class="token punctuation">.</span>cursor<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># 创建表(如果不存在)</span>cursor<span class="token punctuation">.</span>execute<span class="token punctuation">(</span><span class="token triple-quoted-string string">\'\'\' CREATE TABLE IF NOT EXISTS articles ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT, url TEXT UNIQUE );\'\'\'</span><span class="token punctuation">)</span><span class="token comment"># 插入数据</span>items <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">(</span><span class="token string">\'第一篇\'</span><span class="token punctuation">,</span> <span class="token string">\'https://...\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token string">\'第二篇\'</span><span class="token punctuation">,</span> <span class="token string">\'https://...\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token punctuation">]</span><span class="token keyword">for</span> title<span class="token punctuation">,</span> url <span class="token keyword">in</span> items<span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> cursor<span class="token punctuation">.</span>execute<span class="token punctuation">(</span><span class="token string">\'INSERT INTO articles (title, url) VALUES (?, ?)\'</span><span class="token punctuation">,</span> <span class="token punctuation">(</span>title<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">except</span> sqlite3<span class="token punctuation">.</span>IntegrityError<span class="token punctuation">:</span> <span class="token keyword">pass</span> <span class="token comment"># URL 已存在就跳过</span>conn<span class="token punctuation">.</span>commit<span class="token punctuation">(</span><span class="token punctuation">)</span>conn<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
</li>
</ol>
<h4>4.6 常见反爬措施及应对策略</h4>
<ol>
<li>
<p><strong>User-Agent 检测</strong></p>
<ul>
<li>默认 <code>requests</code> 的 User-Agent 大多被识别为“爬虫”,容易被屏蔽。</li>
<li>应用:在请求头中随机选用常见浏览器 User-Agent。</li>
</ul>
<pre><code class="prism language-python"><span class="token keyword">import</span> randomUSER_AGENTS <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">\'Mozilla/5.0 ... Chrome/100.0.4896.127 ...\'</span><span class="token punctuation">,</span> <span class="token string">\'Mozilla/5.0 ... Firefox/110.0 ...\'</span><span class="token punctuation">,</span> <span class="token string">\'Mozilla/5.0 ... Safari/605.1.15 ...\'</span><span class="token punctuation">,</span> <span class="token comment"># 更多可从网上获取</span><span class="token punctuation">]</span>headers <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span><span class="token string">\'User-Agent\'</span><span class="token punctuation">:</span> random<span class="token punctuation">.</span>choice<span class="token punctuation">(</span>USER_AGENTS<span class="token punctuation">)</span><span class="token punctuation">}</span>response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><strong>IP 限制</strong></p>
<ul>
<li>如果同一 IP 在短时间内发起大量请求,服务器可能会封禁或返回 403。</li>
<li>应对:使用代理池(详见第 11 节),定期更换 IP。</li>
</ul>
</li>
<li>
<p><strong>Cookie 验证</strong></p>
<ul>
<li>某些网站登录后才能访问完整内容,需要先模拟登录获取 Cookie,再在后续请求中带上。</li>
<li>用 <code>requests.Session()</code> 管理会话,同一 Session 自动保存并发送 Cookie。</li>
</ul>
<pre><code class="prism language-python"><span class="token keyword">import</span> requestssession <span class="token operator">=</span> requests<span class="token punctuation">.</span>Session<span class="token punctuation">(</span><span class="token punctuation">)</span>login_data <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span><span class="token string">\'username\'</span><span class="token punctuation">:</span> <span class="token string">\'xxx\'</span><span class="token punctuation">,</span> <span class="token string">\'password\'</span><span class="token punctuation">:</span> <span class="token string">\'xxx\'</span><span class="token punctuation">}</span>session<span class="token punctuation">.</span>post<span class="token punctuation">(</span><span class="token string">\'https://example.com/login\'</span><span class="token punctuation">,</span> data<span class="token operator">=</span>login_data<span class="token punctuation">)</span><span class="token comment"># 登录成功后,session 自动保存了 Cookie</span>response <span class="token operator">=</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'https://example.com/protected-page\'</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><strong>验证码</strong></p>
<ul>
<li>简易验证码有时可通过 OCR 自动识别,但复杂图片验证码需要专门打码平台或人工识别。</li>
<li>在入门阶段,尽量选择不需要验证码或抢先获取 API。</li>
</ul>
</li>
<li>
<p><strong>AJAX / 动态渲染</strong></p>
<ul>
<li>如果页面数据是通过 JavaScript 动态加载,直接用 <code>requests</code> 只能获取静态 HTML。</li>
<li>应用:可分析 AJAX 请求接口(Network 面板),直接请求接口返回的 JSON;或使用浏览器自动化工具(Selenium/Playwright)模拟浏览器渲染。</li>
</ul>
</li>
</ol>
<hr />
<h3>5. 进阶篇:更强大的解析工具</h3>
<p>虽然 BeautifulSoup 足以应付大部分新手场景,但当你遇到结构复杂、嵌套多、或需要批量高效提取时,下面这些工具会更适合。</p>
<h4>5.1 lxml (XPath)</h4>
<ul>
<li>
<p><strong>特点</strong>:基于 C 语言实现,解析速度快,支持标准的 XPath 查询。</p>
</li>
<li>
<p><strong>安装</strong>:</p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> lxml</code></pre>
</li>
<li>
<p><strong>示例</strong>:</p>
<pre><code class="prism language-python"><span class="token keyword">from</span> lxml <span class="token keyword">import</span> etreehtml <span class="token operator">=</span> <span class="token triple-quoted-string string">\'\'\' <div class="post"><h2><a href="/p1">文章A</a></h2></div> <div class="post"><h2><a href="/p2">文章B</a></h2></div>\'\'\'</span><span class="token comment"># 1. 将文本转换为 Element 对象</span>tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>html<span class="token punctuation">)</span><span class="token comment"># 2. 使用 XPath 语法提取所有链接文本和 href</span>titles <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">\'//div[@class=\"post\"]/h2/a/text()\'</span><span class="token punctuation">)</span>links <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">\'//div[@class=\"post\"]/h2/a/@href\'</span><span class="token punctuation">)</span><span class="token keyword">for</span> t<span class="token punctuation">,</span> l <span class="token keyword">in</span> <span class="token builtin">zip</span><span class="token punctuation">(</span>titles<span class="token punctuation">,</span> links<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>t<span class="token punctuation">,</span> l<span class="token punctuation">)</span><span class="token comment"># 输出:</span><span class="token comment"># 文章A /p1</span><span class="token comment"># 文章B /p2</span></code></pre>
</li>
<li>
<p><strong>常见 XPath 语法</strong>:</p>
<ul>
<li><code>//tag[@attr=\"value\"]</code>:查找所有符合条件的 tag。</li>
<li><code>text()</code>:获取文本节点;</li>
<li><code>@href</code>:获取属性值;</li>
<li><code>//div//a</code>:查找 div 下所有后代中的 a;</li>
<li><code>//ul/li[1]</code>:查找第一个 li;</li>
<li><code>contains(@class, \"foo\")</code>:class 中包含 foo 的元素。</li>
</ul>
</li>
</ul>
<h4>5.2 parsel(Scrapy 内置的解析器)</h4>
<ul>
<li>
<p><strong>特点</strong>:Scrapy 自带的一套基于 Css/XPath 的快速解析工具,接口与 lxml 类似,但更贴合 Scrapy 的数据提取习惯。</p>
</li>
<li>
<p><strong>安装</strong>:</p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> parsel</code></pre>
</li>
<li>
<p><strong>示例</strong>:</p>
<pre><code class="prism language-python"><span class="token keyword">from</span> parsel <span class="token keyword">import</span> Selectorhtml <span class="token operator">=</span> <span class="token triple-quoted-string string">\'\'\'<ul> <li class="item"><a href="/a1">Item1</a></li> <li class="item"><a href="/a2">Item2</a></li></ul>\'\'\'</span>sel <span class="token operator">=</span> Selector<span class="token punctuation">(</span>text<span class="token operator">=</span>html<span class="token punctuation">)</span><span class="token comment"># 使用 CSS 选择器</span><span class="token keyword">for</span> item <span class="token keyword">in</span> sel<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'li.item\'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> title <span class="token operator">=</span> item<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'a::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> link <span class="token operator">=</span> item<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'a::attr(href)\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">,</span> link<span class="token punctuation">)</span><span class="token comment"># 使用 XPath</span><span class="token keyword">for</span> item <span class="token keyword">in</span> sel<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">\'//li[@class=\"item\"]\'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> title <span class="token operator">=</span> item<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">\'./a/text()\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> link <span class="token operator">=</span> item<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">\'./a/@href\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">,</span> link<span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><code>parsel.Selector</code> 对象在 Scrapy 中经常用到,直接拿过来在项目外部也能用。</p>
</li>
</ul>
<h4>5.3 PyQuery(类似 jQuery 的解析方式)</h4>
<ul>
<li>
<p><strong>特点</strong>:接口风格类似 jQuery,习惯了前端的同学会很快上手。</p>
</li>
<li>
<p><strong>安装</strong>:</p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> pyquery</code></pre>
</li>
<li>
<p><strong>示例</strong>:</p>
<pre><code class="prism language-python"><span class="token keyword">from</span> pyquery <span class="token keyword">import</span> PyQuery <span class="token keyword">as</span> pqhtml <span class="token operator">=</span> <span class="token triple-quoted-string string">\'\'\'<div id="posts"> <h2><a href="/x1">新闻X1</a></h2> <h2><a href="/x2">新闻X2</a></h2></div>\'\'\'</span>doc <span class="token operator">=</span> pq<span class="token punctuation">(</span>html<span class="token punctuation">)</span><span class="token comment"># 通过标签/ID/css 选择器定位</span><span class="token keyword">for</span> item <span class="token keyword">in</span> doc<span class="token punctuation">(</span><span class="token string">\'#posts h2\'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># item 是 lxml 的 Element,需要再次包装</span> a <span class="token operator">=</span> pq<span class="token punctuation">(</span>item<span class="token punctuation">)</span><span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">\'a\'</span><span class="token punctuation">)</span> title <span class="token operator">=</span> a<span class="token punctuation">.</span>text<span class="token punctuation">(</span><span class="token punctuation">)</span> url <span class="token operator">=</span> a<span class="token punctuation">.</span>attr<span class="token punctuation">(</span><span class="token string">\'href\'</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">,</span> url<span class="token punctuation">)</span></code></pre>
</li>
<li>
<p>PyQuery 内部使用 lxml 作为解析器,速度不逊于直接调用 lxml。</p>
</li>
</ul>
<h4>5.4 正则表达式在爬虫中的应用</h4>
<ul>
<li>
<p>正则并不是万能的 HTML 解析方案,但在提取简单规则(如邮箱、电话号码、特定模式字符串)时非常方便。</p>
</li>
<li>
<p>在爬虫中,可先用 BeautifulSoup/lxml 找到相应的大块内容,再对内容字符串用正则提取。</p>
</li>
<li>
<p><strong>示例</strong>:</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> re<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSouphtml <span class="token operator">=</span> <span class="token triple-quoted-string string">\'\'\'<div class="info"> 联系邮箱:abc@example.com 联系电话:123-4567-890</div>\'\'\'</span>soup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">\'lxml\'</span><span class="token punctuation">)</span>info <span class="token operator">=</span> soup<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">\'div\'</span><span class="token punctuation">,</span> class_<span class="token operator">=</span><span class="token string">\'info\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># 匹配邮箱</span>email_pattern <span class="token operator">=</span> <span class="token string">r\'[\\w\\.-]+@[\\w\\.-]+\'</span>emails <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>email_pattern<span class="token punctuation">,</span> info<span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">\'邮箱:\'</span><span class="token punctuation">,</span> emails<span class="token punctuation">)</span><span class="token comment"># 匹配电话号码</span>phone_pattern <span class="token operator">=</span> <span class="token string">r\'\\d{3}-\\d{4}-\\d{3,4}\'</span>phones <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>phone_pattern<span class="token punctuation">,</span> info<span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">\'电话:\'</span><span class="token punctuation">,</span> phones<span class="token punctuation">)</span></code></pre>
</li>
</ul>
<hr />
<h3>6. 框架篇:Scrapy 全面入门</h3>
<p>如果你想快速搭建一个可维护、可扩展的爬虫项目,Scrapy 是 Python 爬虫生态中最成熟、最流行的爬虫框架之一。</p>
<h4>6.1 Scrapy 简介</h4>
<ul>
<li>
<p><strong>Scrapy</strong>:一个专门为大规模网络爬取与信息提取设计的开源框架,具有高性能、多并发、支持分布式、内置各种中间件与管道。</p>
</li>
<li>
<p><strong>适用场景</strong>:</p>
<ul>
<li>大规模爬取同类型大量网页。</li>
<li>对页面进行复杂数据清洗、去重、存储。</li>
<li>需要高度定制化中间件或扩展时。</li>
</ul>
</li>
</ul>
<h4>6.2 安装与项目结构</h4>
<ol>
<li>
<p>安装 Scrapy:</p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> scrapy</code></pre>
</li>
<li>
<p>创建 Scrapy 项目:</p>
<pre><code class="prism language-bash">scrapy startproject myproject</code></pre>
</li>
<li>
<p>项目目录结构(示例):</p>
<pre><code>myproject/ scrapy.cfg # 部署时使用的配置文件 myproject/ # 项目 Python 模块 __init__.py items.py # 定义数据模型(Item) middlewares.py # 自定义中间件 pipelines.py # 数据处理与存储 Pipeline settings.py # Scrapy 全局配置 spiders/ # 各种爬虫文件放在这里 __init__.py example_spider.py</code></pre>
</li>
</ol>
<h4>6.3 编写第一个 Scrapy 爬虫 Spider</h4>
<p>假设我们要爬去 <code>quotes.toscrape.com</code> 网站上所有名言及作者:</p>
<ol>
<li>
<p>在 <code>myproject/spiders/</code> 下新建 <code>quotes_spider.py</code>:</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> scrapy<span class="token keyword">from</span> myproject<span class="token punctuation">.</span>items <span class="token keyword">import</span> MyprojectItem<span class="token keyword">class</span> <span class="token class-name">QuotesSpider</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Spider<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> <span class="token string">\'quotes\'</span> <span class="token comment"># 爬虫名,运行时指定</span> allowed_domains <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">\'quotes.toscrape.com\'</span><span class="token punctuation">]</span> start_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">\'https://quotes.toscrape.com/\'</span><span class="token punctuation">]</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 提取每个名言块</span> <span class="token keyword">for</span> quote <span class="token keyword">in</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'div.quote\'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> item <span class="token operator">=</span> MyprojectItem<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">\'text\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> quote<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'span.text::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">\'author\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> quote<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'small.author::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">\'tags\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> quote<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'div.tags a.tag::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>getall<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">yield</span> item <span class="token comment"># 翻页:获取下一页链接并递归</span> next_page <span class="token operator">=</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'li.next a::attr(href)\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">if</span> next_page<span class="token punctuation">:</span> <span class="token keyword">yield</span> response<span class="token punctuation">.</span>follow<span class="token punctuation">(</span>next_page<span class="token punctuation">,</span> callback<span class="token operator">=</span>self<span class="token punctuation">.</span>parse<span class="token punctuation">)</span></code></pre>
</li>
<li>
<p>定义 Item 模型 (<code>myproject/items.py</code>):</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> scrapy<span class="token keyword">class</span> <span class="token class-name">MyprojectItem</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Item<span class="token punctuation">)</span><span class="token punctuation">:</span> text <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> author <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> tags <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p>配置数据存储 Pipeline(可选存储到 JSON/CSV/数据库),如在 <code>myproject/pipelines.py</code>:</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> json<span class="token keyword">class</span> <span class="token class-name">JsonWriterPipeline</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">open_spider</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span><span class="token builtin">file</span> <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">\'quotes.json\'</span><span class="token punctuation">,</span> <span class="token string">\'w\'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">\'utf-8\'</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span><span class="token builtin">file</span><span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">\'[\\n\'</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">close_spider</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span><span class="token builtin">file</span><span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">\'\\n]\'</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span><span class="token builtin">file</span><span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">process_item</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> item<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> line <span class="token operator">=</span> json<span class="token punctuation">.</span>dumps<span class="token punctuation">(</span><span class="token builtin">dict</span><span class="token punctuation">(</span>item<span class="token punctuation">)</span><span class="token punctuation">,</span> ensure_ascii<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span><span class="token builtin">file</span><span class="token punctuation">.</span>write<span class="token punctuation">(</span>line <span class="token operator">+</span> <span class="token string">\',\\n\'</span><span class="token punctuation">)</span> <span class="token keyword">return</span> item</code></pre>
<p>并在 <code>settings.py</code> 中启用:</p>
<pre><code class="prism language-python">ITEM_PIPELINES <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span> <span class="token string">\'myproject.pipelines.JsonWriterPipeline\'</span><span class="token punctuation">:</span> <span class="token number">300</span><span class="token punctuation">,</span><span class="token punctuation">}</span></code></pre>
</li>
<li>
<p>运行爬虫:</p>
<pre><code class="prism language-bash">scrapy crawl quotes</code></pre>
<p>运行后,会在项目根目录生成 <code>quotes.json</code>,其中包含抓取到的所有名言数据。</p>
</li>
</ol>
<h4>6.4 Item、Pipeline、Settings 详解</h4>
<ul>
<li><strong>Items (<code>items.py</code>)</strong>:定义要提取的数据结构与字段,相当于“数据模型”。</li>
<li><strong>Spiders (<code>spiders/xxx.py</code>)</strong>:每个 spider 文件对应一个任务,可接收 <code>start_urls</code>、<code>allowed_domains</code>、<code>parse()</code> 回调等。可自定义不同的回调函数来解析不同页面。</li>
<li><strong>Pipelines (<code>pipelines.py</code>)</strong>:处理从 Spider 返回的 Item,常见操作包括数据清洗(去重、格式化)、存储(写入 JSON/CSV、入库)、下载附件等。</li>
<li><strong>Settings (<code>settings.py</code>)</strong>:全局配置文件,包含并发数(<code>CONCURRENT_REQUESTS</code>)、下载延时(<code>DOWNLOAD_DELAY</code>)、中间件配置、管道配置、User-Agent 等。</li>
</ul>
<p>常见 Settings 配置示例:</p>
<pre><code class="prism language-python"><span class="token comment"># settings.py(只列部分) </span>BOT_NAME <span class="token operator">=</span> <span class="token string">\'myproject\'</span>SPIDER_MODULES <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">\'myproject.spiders\'</span><span class="token punctuation">]</span>NEWSPIDER_MODULE <span class="token operator">=</span> <span class="token string">\'myproject.spiders\'</span><span class="token comment"># 遵循 robots 协议</span>ROBOTSTXT_OBEY <span class="token operator">=</span> <span class="token boolean">True</span><span class="token comment"># 并发请求数(默认 16)</span>CONCURRENT_REQUESTS <span class="token operator">=</span> <span class="token number">8</span><span class="token comment"># 下载延时(秒),防止对目标站造成过大压力</span>DOWNLOAD_DELAY <span class="token operator">=</span> <span class="token number">1</span><span class="token comment"># 配置 User-Agent</span>DEFAULT_REQUEST_HEADERS <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span> <span class="token string">\'User-Agent\'</span><span class="token punctuation">:</span> <span class="token string">\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...\'</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token comment"># 启用 Pipeline</span>ITEM_PIPELINES <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span> <span class="token string">\'myproject.pipelines.JsonWriterPipeline\'</span><span class="token punctuation">:</span> <span class="token number">300</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token comment"># 启用或禁用中间件、扩展、管道等</span>DOWNLOADER_MIDDLEWARES <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span> <span class="token comment"># \'myproject.middlewares.SomeDownloaderMiddleware\': 543,</span><span class="token punctuation">}</span><span class="token comment"># 日志等级</span>LOG_LEVEL <span class="token operator">=</span> <span class="token string">\'INFO\'</span></code></pre>
<h4>6.5 Scrapy Shell 在线调试</h4>
<ul>
<li>
<p>Scrapy 提供了 <code>scrapy shell </code> 命令,可以快速测试 XPath、CSS 选择器。</p>
<pre><code class="prism language-bash">scrapy shell <span class="token string">\'https://quotes.toscrape.com/\'</span></code></pre>
</li>
<li>
<p>进入 shell 后,你可以执行:</p>
<pre><code class="prism language-python"><span class="token operator">>></span><span class="token operator">></span> response<span class="token punctuation">.</span>status<span class="token number">200</span><span class="token operator">>></span><span class="token operator">></span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'div.quote span.text::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>getall<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token string">\'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”\'</span><span class="token punctuation">,</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">]</span><span class="token operator">>></span><span class="token operator">></span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">\'//div[@class=\"quote\"]/span[@class=\"text\"]/text()\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>getall<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p>Shell 模式下,你可以快速试错、验证提取逻辑,比写完整 Spider 再跑要高效很多。</p>
</li>
</ul>
<h4>6.6 分布式与多线程:Scrapy 爬虫并发配置</h4>
<ul>
<li><strong>并发请求数</strong>:在 <code>settings.py</code> 中设置 <code>CONCURRENT_REQUESTS</code>(默认 16);</li>
<li><strong>单域名并发</strong>:<code>CONCURRENT_REQUESTS_PER_DOMAIN</code>(默认 8);</li>
<li><strong>单 IP 并发</strong>:<code>CONCURRENT_REQUESTS_PER_IP</code>;</li>
<li><strong>下载延时</strong>:<code>DOWNLOAD_DELAY</code>(默认 0);</li>
<li><strong>自动限速</strong>:<code>AUTOTHROTTLE_ENABLED = True</code>,配合 <code>AUTOTHROTTLE_START_DELAY</code>、<code>AUTOTHROTTLE_MAX_DELAY</code> 等。</li>
<li><strong>并行请求</strong>:Scrapy 内部使用 Twisted 异步网络库实现高并发,单机即可轻松处理成千上万请求。</li>
</ul>
<h4>6.7 Scrapy 中间件与扩展(Downloader Middleware、Downloader Handler)</h4>
<ul>
<li>
<p><strong>Downloader Middleware</strong>:位于 Scrapy 引擎与下载器之间,可控制请求/响应,常用于:</p>
<ul>
<li>动态设置 User-Agent、Proxy;</li>
<li>拦截并修改请求/响应头;</li>
<li>处理重试(Retry)、重定向(Redirect)等。</li>
</ul>
</li>
<li>
<p><strong>示例:随机 User-Agent Middleware</strong></p>
<pre><code class="prism language-python"><span class="token comment"># myproject/middlewares.py</span><span class="token keyword">import</span> random<span class="token keyword">from</span> scrapy <span class="token keyword">import</span> signals<span class="token keyword">class</span> <span class="token class-name">RandomUserAgentMiddleware</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> user_agents<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>user_agents <span class="token operator">=</span> user_agents <span class="token decorator annotation punctuation">@classmethod</span> <span class="token keyword">def</span> <span class="token function">from_crawler</span><span class="token punctuation">(</span>cls<span class="token punctuation">,</span> crawler<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">return</span> cls<span class="token punctuation">(</span> user_agents<span class="token operator">=</span>crawler<span class="token punctuation">.</span>settings<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'USER_AGENTS_LIST\'</span><span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">process_request</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> request<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> ua <span class="token operator">=</span> random<span class="token punctuation">.</span>choice<span class="token punctuation">(</span>self<span class="token punctuation">.</span>user_agents<span class="token punctuation">)</span> request<span class="token punctuation">.</span>headers<span class="token punctuation">.</span>setdefault<span class="token punctuation">(</span><span class="token string">\'User-Agent\'</span><span class="token punctuation">,</span> ua<span class="token punctuation">)</span></code></pre>
<p>并在 <code>settings.py</code> 中配置:</p>
<pre><code class="prism language-python">USER_AGENTS_LIST <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">\'Mozilla/5.0 ... Chrome/100.0 ...\'</span><span class="token punctuation">,</span> <span class="token string">\'Mozilla/5.0 ... Firefox/110.0 ...\'</span><span class="token punctuation">,</span> <span class="token comment"># 更多 User-Agent</span><span class="token punctuation">]</span>DOWNLOADER_MIDDLEWARES <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span> <span class="token string">\'myproject.middlewares.RandomUserAgentMiddleware\'</span><span class="token punctuation">:</span> <span class="token number">400</span><span class="token punctuation">,</span> <span class="token string">\'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware\'</span><span class="token punctuation">:</span> <span class="token boolean">None</span><span class="token punctuation">,</span><span class="token punctuation">}</span></code></pre>
</li>
<li>
<p><strong>Downloader Handler</strong>:更底层的接口,一般不常用,Scrapy 已提供 <code>HttpDownloadHandler</code>、<code>S3DownloadHandler</code> 等。</p>
</li>
</ul>
<hr />
<h3>7. 动态内容爬取:Selenium 与 Playwright</h3>
<p>当目标网页内容依赖 JavaScript 动态渲染时,单纯用 <code>requests</code> 或 Scrapy 获取到的 HTML 往往不包含最终可视化的数据。此时可以使用“浏览器自动化”工具,让其像真实浏览器一样加载页面,再提取渲染后的内容。</p>
<h4>7.1 为什么需要浏览器自动化?</h4>
<ul>
<li>
<p>许多现代网站(尤其是单页应用 SPA)使用 React、Vue、Angular 等框架,通过 AJAX 或 API 获取数据并在前端渲染,直接请求 URL 只能拿到空白或框架代码。</p>
</li>
<li>
<p>浏览器自动化可以:</p>
<ol>
<li>启动一个真实或无头浏览器实例;</li>
<li>访问页面,等待 JavaScript 执行完成;</li>
<li>拿到渲染完毕的 DOM,然后再用解析库提取。</li>
</ol>
</li>
</ul>
<h4>7.2 Selenium 基础用法</h4>
<ol>
<li>
<p><strong>安装</strong>:</p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> selenium</code></pre>
</li>
<li>
<p><strong>下载 WebDriver</strong>(以 Chrome 为例):</p>
<ul>
<li>前往 ChromeDriver 下载页面 ,下载与本地 Chrome 版本相匹配的 <code>chromedriver</code>。</li>
<li>将 <code>chromedriver</code> 放置在系统 PATH 下,或在代码中指定路径。</li>
</ul>
</li>
<li>
<p><strong>示例:抓取动态网页内容</strong></p>
<pre><code class="prism language-python"><span class="token keyword">from</span> selenium <span class="token keyword">import</span> webdriver<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>service <span class="token keyword">import</span> Service <span class="token keyword">as</span> ChromeService<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>common<span class="token punctuation">.</span>by <span class="token keyword">import</span> By<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options<span class="token keyword">import</span> time<span class="token comment"># 1. 配置 Chrome 选项</span>chrome_options <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span>chrome_options<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">\'--headless\'</span><span class="token punctuation">)</span> <span class="token comment"># 无界面模式</span>chrome_options<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">\'--no-sandbox\'</span><span class="token punctuation">)</span>chrome_options<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">\'--disable-gpu\'</span><span class="token punctuation">)</span><span class="token comment"># 2. 指定 chromedriver 路径或直接放到 PATH 中</span>service <span class="token operator">=</span> ChromeService<span class="token punctuation">(</span>executable_path<span class="token operator">=</span><span class="token string">\'path/to/chromedriver\'</span><span class="token punctuation">)</span><span class="token comment"># 3. 创建 WebDriver</span>driver <span class="token operator">=</span> webdriver<span class="token punctuation">.</span>Chrome<span class="token punctuation">(</span>service<span class="token operator">=</span>service<span class="token punctuation">,</span> options<span class="token operator">=</span>chrome_options<span class="token punctuation">)</span><span class="token keyword">try</span><span class="token punctuation">:</span> <span class="token comment"># 4. 打开页面</span> driver<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'https://quotes.toscrape.com/js/\'</span><span class="token punctuation">)</span> <span class="token comment"># 这是一个 JavaScript 渲染的示例</span> <span class="token comment"># 5. 等待 JS 渲染,最简单的方式:time.sleep(建议改用显式/隐式等待)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 6. 提取渲染后的 HTML</span> html <span class="token operator">=</span> driver<span class="token punctuation">.</span>page_source <span class="token comment"># 7. 交给 BeautifulSoup 或 lxml 解析</span> <span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup soup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">\'lxml\'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> quote <span class="token keyword">in</span> soup<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'div.quote\'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> text <span class="token operator">=</span> quote<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">\'span\'</span><span class="token punctuation">,</span> class_<span class="token operator">=</span><span class="token string">\'text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span> author <span class="token operator">=</span> quote<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">\'small\'</span><span class="token punctuation">,</span> class_<span class="token operator">=</span><span class="token string">\'author\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>text<span class="token punctuation">,</span> author<span class="token punctuation">)</span><span class="token keyword">finally</span><span class="token punctuation">:</span> driver<span class="token punctuation">.</span>quit<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><strong>显式等待与隐式等待</strong></p>
<ul>
<li>
<p><strong>隐式等待</strong>:<code>driver.implicitly_wait(10)</code>,在寻找元素时最长等待 10 秒;</p>
</li>
<li>
<p><strong>显式等待</strong>:使用 <code>WebDriverWait</code> 与 <code>ExpectedConditions</code>,例如:</p>
<pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>support<span class="token punctuation">.</span>ui <span class="token keyword">import</span> WebDriverWait<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>support <span class="token keyword">import</span> expected_conditions <span class="token keyword">as</span> ECelement <span class="token operator">=</span> WebDriverWait<span class="token punctuation">(</span>driver<span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">)</span><span class="token punctuation">.</span>until<span class="token punctuation">(</span> EC<span class="token punctuation">.</span>presence_of_element_located<span class="token punctuation">(</span><span class="token punctuation">(</span>By<span class="token punctuation">.</span>CSS_SELECTOR<span class="token punctuation">,</span> <span class="token string">\'div.quote\'</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
</li>
</ul>
</li>
</ol>
<h4>7.3 Playwright for Python(更快更轻量)</h4>
<ul>
<li>
<p><strong>Playwright</strong>:由微软维护、继承自 Puppeteer 的跨浏览器自动化库,支持 Chromium、Firefox、WebKit,无需单独下载 WebDriver。</p>
</li>
<li>
<p><strong>优点</strong>:启动速度快、API 简洁、并发控制更灵活。</p>
</li>
<li>
<p><strong>安装</strong>:</p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> playwright<span class="token comment"># 安装浏览器内核(只需第一次执行)</span>playwright <span class="token function">install</span></code></pre>
</li>
<li>
<p><strong>示例:抓取动态内容</strong></p>
<pre><code class="prism language-python"><span class="token keyword">import</span> asyncio<span class="token keyword">from</span> playwright<span class="token punctuation">.</span>async_api <span class="token keyword">import</span> async_playwright<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup<span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">async</span> <span class="token keyword">with</span> async_playwright<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">as</span> p<span class="token punctuation">:</span> browser <span class="token operator">=</span> <span class="token keyword">await</span> p<span class="token punctuation">.</span>chromium<span class="token punctuation">.</span>launch<span class="token punctuation">(</span>headless<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> page <span class="token operator">=</span> <span class="token keyword">await</span> browser<span class="token punctuation">.</span>new_page<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">await</span> page<span class="token punctuation">.</span>goto<span class="token punctuation">(</span><span class="token string">\'https://quotes.toscrape.com/js/\'</span><span class="token punctuation">)</span> <span class="token comment"># 可选:等待某个元素加载完成</span> <span class="token keyword">await</span> page<span class="token punctuation">.</span>wait_for_selector<span class="token punctuation">(</span><span class="token string">\'div.quote\'</span><span class="token punctuation">)</span> content <span class="token operator">=</span> <span class="token keyword">await</span> page<span class="token punctuation">.</span>content<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 获取渲染后的 HTML</span> <span class="token keyword">await</span> browser<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 交给 BeautifulSoup 解析</span> soup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>content<span class="token punctuation">,</span> <span class="token string">\'lxml\'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> quote <span class="token keyword">in</span> soup<span class="token punctuation">.</span>select<span class="token punctuation">(</span><span class="token string">\'div.quote\'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> text <span class="token operator">=</span> quote<span class="token punctuation">.</span>select_one<span class="token punctuation">(</span><span class="token string">\'span.text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span> author <span class="token operator">=</span> quote<span class="token punctuation">.</span>select_one<span class="token punctuation">(</span><span class="token string">\'small.author\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>text<span class="token punctuation">,</span> author<span class="token punctuation">)</span><span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">\'__main__\'</span><span class="token punctuation">:</span> asyncio<span class="token punctuation">.</span>run<span class="token punctuation">(</span>main<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><strong>同步版 Playwright</strong><br /> 如果你不想使用异步,也可以借助 <code>sync_api</code>:</p>
<pre><code class="prism language-python"><span class="token keyword">from</span> playwright<span class="token punctuation">.</span>sync_api <span class="token keyword">import</span> sync_playwright<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup<span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">with</span> sync_playwright<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">as</span> p<span class="token punctuation">:</span> browser <span class="token operator">=</span> p<span class="token punctuation">.</span>chromium<span class="token punctuation">.</span>launch<span class="token punctuation">(</span>headless<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> page <span class="token operator">=</span> browser<span class="token punctuation">.</span>new_page<span class="token punctuation">(</span><span class="token punctuation">)</span> page<span class="token punctuation">.</span>goto<span class="token punctuation">(</span><span class="token string">\'https://quotes.toscrape.com/js/\'</span><span class="token punctuation">)</span> page<span class="token punctuation">.</span>wait_for_selector<span class="token punctuation">(</span><span class="token string">\'div.quote\'</span><span class="token punctuation">)</span> html <span class="token operator">=</span> page<span class="token punctuation">.</span>content<span class="token punctuation">(</span><span class="token punctuation">)</span> browser<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> soup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">\'lxml\'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> quote <span class="token keyword">in</span> soup<span class="token punctuation">.</span>select<span class="token punctuation">(</span><span class="token string">\'div.quote\'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> text <span class="token operator">=</span> quote<span class="token punctuation">.</span>select_one<span class="token punctuation">(</span><span class="token string">\'span.text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span> author <span class="token operator">=</span> quote<span class="token punctuation">.</span>select_one<span class="token punctuation">(</span><span class="token string">\'small.author\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>text<span class="token punctuation">,</span> author<span class="token punctuation">)</span><span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">\'__main__\'</span><span class="token punctuation">:</span> main<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
</li>
</ul>
<h4>7.4 无头浏览器(headless)模式及性能优化</h4>
<ul>
<li>
<p><strong>无头模式</strong>:在 Linux 服务器等环境下,没有图形界面,需要 <code>--headless</code> 参数;在 macOS/Windows 上也可加速启动。</p>
</li>
<li>
<p><strong>资源限制</strong>:可以通过设置启动参数降低资源占用,如:</p>
<ul>
<li>Chrome:<code>chrome_options.add_argument(\'--disable-gpu\')</code>、<code>--no-sandbox</code>、<code>--disable-dev-shm-usage</code>;</li>
<li>Playwright:<code>browser = await p.chromium.launch(headless=True, args=[\'--disable-gpu\', \'--no-sandbox\'])</code>。</li>
</ul>
</li>
<li>
<p><strong>避免过度渲染</strong>:如果只想拿纯数据,尽量通过分析接口(XHR 请求)直接调用后台 API,不必启动完整浏览器。</p>
</li>
</ul>
<h4>7.5 结合 Selenium/Playwright 与 BeautifulSoup 解析</h4>
<p>一般流程:</p>
<ol>
<li>用 Selenium/Playwright 拿到渲染后的 <code>page_source</code> 或 <code>content()</code>;</li>
<li>用 BeautifulSoup/lxml 对 HTML 进行二次解析与提取。</li>
</ol>
<p>示例综合:</p>
<pre><code class="prism language-python"><span class="token keyword">from</span> selenium <span class="token keyword">import</span> webdriver<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>service <span class="token keyword">import</span> Service <span class="token keyword">as</span> ChromeService<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoupchrome_options <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span>chrome_options<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">\'--headless\'</span><span class="token punctuation">)</span>service <span class="token operator">=</span> ChromeService<span class="token punctuation">(</span><span class="token string">\'path/to/chromedriver\'</span><span class="token punctuation">)</span>driver <span class="token operator">=</span> webdriver<span class="token punctuation">.</span>Chrome<span class="token punctuation">(</span>service<span class="token operator">=</span>service<span class="token punctuation">,</span> options<span class="token operator">=</span>chrome_options<span class="token punctuation">)</span><span class="token keyword">try</span><span class="token punctuation">:</span> driver<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'https://example.com/dynamic-page\'</span><span class="token punctuation">)</span> driver<span class="token punctuation">.</span>implicitly_wait<span class="token punctuation">(</span><span class="token number">5</span><span class="token punctuation">)</span> html <span class="token operator">=</span> driver<span class="token punctuation">.</span>page_source soup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">\'lxml\'</span><span class="token punctuation">)</span> <span class="token comment"># 根据解析需求提取数据</span> <span class="token keyword">for</span> item <span class="token keyword">in</span> soup<span class="token punctuation">.</span>select<span class="token punctuation">(</span><span class="token string">\'div.article\'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> title <span class="token operator">=</span> item<span class="token punctuation">.</span>select_one<span class="token punctuation">(</span><span class="token string">\'h1\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span> content <span class="token operator">=</span> item<span class="token punctuation">.</span>select_one<span class="token punctuation">(</span><span class="token string">\'div.content\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span>strip<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">,</span> content<span class="token punctuation">)</span><span class="token keyword">finally</span><span class="token punctuation">:</span> driver<span class="token punctuation">.</span>quit<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
<hr />
<h3>8. 异步爬虫:aiohttp + asyncio 与 HTTPX</h3>
<p>当面对上千个、甚至上万个链接需要同时抓取时,同步阻塞式的 <code>requests</code> 就显得效率低下。Python 原生的 <code>asyncio</code> 协程、<code>aiohttp</code> 库或 <code>httpx</code> 异步模式可以极大提升并发性能。</p>
<h4>8.1 同步 vs 异步:性能原理简述</h4>
<ul>
<li><strong>同步(Blocking)</strong>:一次请求完毕后才开始下一次请求。</li>
<li><strong>异步(Non-Blocking)</strong>:发出请求后可立即切换到其他任务,网络 I/O 等待期间不阻塞线程。</li>
<li>对于 I/O 密集型爬虫,异步能显著提高吞吐量。</li>
</ul>
<h4>8.2 aiohttp 入门示例</h4>
<ol>
<li>
<p><strong>安装</strong>:</p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> aiohttp</code></pre>
</li>
<li>
<p><strong>使用 asyncio + aiohttp 并发抓取</strong></p>
<pre><code class="prism language-python"><span class="token keyword">import</span> asyncio<span class="token keyword">import</span> aiohttp<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup<span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">fetch</span><span class="token punctuation">(</span>session<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> <span class="token keyword">async</span> <span class="token keyword">with</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> timeout<span class="token operator">=</span><span class="token number">10</span><span class="token punctuation">)</span> <span class="token keyword">as</span> response<span class="token punctuation">:</span> text <span class="token operator">=</span> <span class="token keyword">await</span> response<span class="token punctuation">.</span>text<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">return</span> text <span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f\'抓取 </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>url<span class="token punctuation">}</span></span><span class="token string"> 失败:</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>e<span class="token punctuation">}</span></span><span class="token string">\'</span></span><span class="token punctuation">)</span> <span class="token keyword">return</span> <span class="token boolean">None</span><span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>html<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">if</span> <span class="token keyword">not</span> html<span class="token punctuation">:</span> <span class="token keyword">return</span> soup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">\'lxml\'</span><span class="token punctuation">)</span> title <span class="token operator">=</span> soup<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">\'title\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span>strip<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> <span class="token keyword">if</span> soup<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">\'title\'</span><span class="token punctuation">)</span> <span class="token keyword">else</span> <span class="token string">\'N/A\'</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f\'URL: </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>url<span class="token punctuation">}</span></span><span class="token string">,Title: </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>title<span class="token punctuation">}</span></span><span class="token string">\'</span></span><span class="token punctuation">)</span><span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span>urls<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># connector 限制最大并发数,防止打开过多 TCP 连接</span> conn <span class="token operator">=</span> aiohttp<span class="token punctuation">.</span>TCPConnector<span class="token punctuation">(</span>limit<span class="token operator">=</span><span class="token number">50</span><span class="token punctuation">)</span> <span class="token keyword">async</span> <span class="token keyword">with</span> aiohttp<span class="token punctuation">.</span>ClientSession<span class="token punctuation">(</span>connector<span class="token operator">=</span>conn<span class="token punctuation">)</span> <span class="token keyword">as</span> session<span class="token punctuation">:</span> tasks <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> urls<span class="token punctuation">:</span> task <span class="token operator">=</span> asyncio<span class="token punctuation">.</span>create_task<span class="token punctuation">(</span>fetch<span class="token punctuation">(</span>session<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">)</span> tasks<span class="token punctuation">.</span>append<span class="token punctuation">(</span>task<span class="token punctuation">)</span> <span class="token comment"># gather 等待所有 fetch 完成</span> htmls <span class="token operator">=</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>gather<span class="token punctuation">(</span><span class="token operator">*</span>tasks<span class="token punctuation">)</span> <span class="token comment"># 逐一解析</span> <span class="token keyword">for</span> html<span class="token punctuation">,</span> url <span class="token keyword">in</span> <span class="token builtin">zip</span><span class="token punctuation">(</span>htmls<span class="token punctuation">,</span> urls<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">await</span> parse<span class="token punctuation">(</span>html<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">\'__main__\'</span><span class="token punctuation">:</span> urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string-interpolation"><span class="token string">f\'https://example.com/page/</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>i<span class="token punctuation">}</span></span><span class="token string">\'</span></span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">101</span><span class="token punctuation">)</span><span class="token punctuation">]</span> asyncio<span class="token punctuation">.</span>run<span class="token punctuation">(</span>main<span class="token punctuation">(</span>urls<span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><strong>说明</strong>:</p>
<ul>
<li><code>aiohttp.TCPConnector(limit=50)</code> 将并发连接限制在 50,避免短时间打开过多连接被服务器封。</li>
<li><code>asyncio.create_task</code> 创建并发 Task,交由事件循环调度。</li>
<li><code>await asyncio.gather(*)</code> 等待所有任务完成。</li>
</ul>
</li>
</ol>
<h4>8.3 使用 asyncio 协程池提高并发</h4>
<p>如果需要对抓取和解析做更精细的并行控制,可使用 <code>asyncio.Semaphore</code> 或第三方协程池库(如 aiomultiprocess、aiojobs)来控制并发数。</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> asyncio<span class="token keyword">import</span> aiohttp<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoupsemaphore <span class="token operator">=</span> asyncio<span class="token punctuation">.</span>Semaphore<span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">)</span> <span class="token comment"># 最多同时跑 20 个协程</span><span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">fetch_with_sem</span><span class="token punctuation">(</span>session<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">async</span> <span class="token keyword">with</span> semaphore<span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> <span class="token keyword">async</span> <span class="token keyword">with</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> timeout<span class="token operator">=</span><span class="token number">10</span><span class="token punctuation">)</span> <span class="token keyword">as</span> resp<span class="token punctuation">:</span> <span class="token keyword">return</span> <span class="token keyword">await</span> resp<span class="token punctuation">.</span>text<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f\'Error fetching </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>url<span class="token punctuation">}</span></span><span class="token string">: </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>e<span class="token punctuation">}</span></span><span class="token string">\'</span></span><span class="token punctuation">)</span> <span class="token keyword">return</span> <span class="token boolean">None</span><span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span>urls<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">async</span> <span class="token keyword">with</span> aiohttp<span class="token punctuation">.</span>ClientSession<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">as</span> session<span class="token punctuation">:</span> tasks <span class="token operator">=</span> <span class="token punctuation">[</span>asyncio<span class="token punctuation">.</span>create_task<span class="token punctuation">(</span>fetch_with_sem<span class="token punctuation">(</span>session<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> urls<span class="token punctuation">]</span> results <span class="token operator">=</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>gather<span class="token punctuation">(</span><span class="token operator">*</span>tasks<span class="token punctuation">)</span> <span class="token keyword">for</span> html<span class="token punctuation">,</span> url <span class="token keyword">in</span> <span class="token builtin">zip</span><span class="token punctuation">(</span>results<span class="token punctuation">,</span> urls<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">if</span> html<span class="token punctuation">:</span> title <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">\'lxml\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">\'title\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span>strip<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>url<span class="token punctuation">,</span> title<span class="token punctuation">)</span><span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">\'__main__\'</span><span class="token punctuation">:</span> sample_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string-interpolation"><span class="token string">f\'https://example.com/page/</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>i<span class="token punctuation">}</span></span><span class="token string">\'</span></span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">51</span><span class="token punctuation">)</span><span class="token punctuation">]</span> asyncio<span class="token punctuation">.</span>run<span class="token punctuation">(</span>main<span class="token punctuation">(</span>sample_urls<span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<h4>8.4 HTTPX:Requests 的异步升级版</h4>
<ul>
<li>
<p><strong>HTTPX</strong>:由 Encode 团队开发,与 <code>requests</code> API 十分相似,同时支持同步与异步模式。</p>
</li>
<li>
<p><strong>安装</strong>:</p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> httpx</code></pre>
</li>
<li>
<p><strong>示例</strong>:</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> asyncio<span class="token keyword">import</span> httpx<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup<span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">fetch</span><span class="token punctuation">(</span>client<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> resp <span class="token operator">=</span> <span class="token keyword">await</span> client<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> timeout<span class="token operator">=</span><span class="token number">10.0</span><span class="token punctuation">)</span> resp<span class="token punctuation">.</span>raise_for_status<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">return</span> resp<span class="token punctuation">.</span>text <span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f\'Error </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>url<span class="token punctuation">}</span></span><span class="token string">: </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>e<span class="token punctuation">}</span></span><span class="token string">\'</span></span><span class="token punctuation">)</span> <span class="token keyword">return</span> <span class="token boolean">None</span><span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span>urls<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">async</span> <span class="token keyword">with</span> httpx<span class="token punctuation">.</span>AsyncClient<span class="token punctuation">(</span>limits<span class="token operator">=</span>httpx<span class="token punctuation">.</span>Limits<span class="token punctuation">(</span>max_connections<span class="token operator">=</span><span class="token number">50</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">as</span> client<span class="token punctuation">:</span> tasks <span class="token operator">=</span> <span class="token punctuation">[</span>asyncio<span class="token punctuation">.</span>create_task<span class="token punctuation">(</span>fetch<span class="token punctuation">(</span>client<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> urls<span class="token punctuation">]</span> <span class="token keyword">for</span> coro <span class="token keyword">in</span> asyncio<span class="token punctuation">.</span>as_completed<span class="token punctuation">(</span>tasks<span class="token punctuation">)</span><span class="token punctuation">:</span> html <span class="token operator">=</span> <span class="token keyword">await</span> coro <span class="token keyword">if</span> html<span class="token punctuation">:</span> title <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">\'lxml\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">\'title\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span>strip<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">\'Title:\'</span><span class="token punctuation">,</span> title<span class="token punctuation">)</span><span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">\'__main__\'</span><span class="token punctuation">:</span> urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string-interpolation"><span class="token string">f\'https://example.com/page/</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>i<span class="token punctuation">}</span></span><span class="token string">\'</span></span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">101</span><span class="token punctuation">)</span><span class="token punctuation">]</span> asyncio<span class="token punctuation">.</span>run<span class="token punctuation">(</span>main<span class="token punctuation">(</span>urls<span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p>与 <code>requests</code> 兼容的 API(如 <code>.get()</code>、<code>.post()</code>、<code>.json()</code>、<code>.text</code> 等),极大降低了上手门槛。</p>
</li>
</ul>
<h4>8.5 异步下使用解析库示例(aiohttp + lxml)</h4>
<pre><code class="prism language-python"><span class="token keyword">import</span> asyncio<span class="token keyword">import</span> aiohttp<span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree<span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">fetch_and_parse</span><span class="token punctuation">(</span>session<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> <span class="token keyword">async</span> <span class="token keyword">with</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> timeout<span class="token operator">=</span><span class="token number">10</span><span class="token punctuation">)</span> <span class="token keyword">as</span> resp<span class="token punctuation">:</span> text <span class="token operator">=</span> <span class="token keyword">await</span> resp<span class="token punctuation">.</span>text<span class="token punctuation">(</span><span class="token punctuation">)</span> tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>text<span class="token punctuation">)</span> <span class="token comment"># 提取第一条消息</span> msg <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">\'//div[@class=\"msg\"]/text()\'</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>url<span class="token punctuation">,</span> msg<span class="token punctuation">)</span> <span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f\'Error fetching </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>url<span class="token punctuation">}</span></span><span class="token string">: </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>e<span class="token punctuation">}</span></span><span class="token string">\'</span></span><span class="token punctuation">)</span><span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span>urls<span class="token punctuation">)</span><span class="token punctuation">:</span> conn <span class="token operator">=</span> aiohttp<span class="token punctuation">.</span>TCPConnector<span class="token punctuation">(</span>limit<span class="token operator">=</span><span class="token number">30</span><span class="token punctuation">)</span> <span class="token keyword">async</span> <span class="token keyword">with</span> aiohttp<span class="token punctuation">.</span>ClientSession<span class="token punctuation">(</span>connector<span class="token operator">=</span>conn<span class="token punctuation">)</span> <span class="token keyword">as</span> session<span class="token punctuation">:</span> tasks <span class="token operator">=</span> <span class="token punctuation">[</span>fetch_and_parse<span class="token punctuation">(</span>session<span class="token punctuation">,</span> url<span class="token punctuation">)</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> urls<span class="token punctuation">]</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>gather<span class="token punctuation">(</span><span class="token operator">*</span>tasks<span class="token punctuation">)</span><span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">\'__main__\'</span><span class="token punctuation">:</span> url_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string-interpolation"><span class="token string">f\'https://example.com/messages/</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>i<span class="token punctuation">}</span></span><span class="token string">\'</span></span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">51</span><span class="token punctuation">)</span><span class="token punctuation">]</span> asyncio<span class="token punctuation">.</span>run<span class="token punctuation">(</span>main<span class="token punctuation">(</span>url_list<span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre>
<hr />
<h3>9. 数据存储与去重</h3>
<p>爬虫的最终目的是获取并存储有价值的数据,因此选择合适的存储方式与去重机制至关重要。</p>
<h4>9.1 本地文件:CSV、JSON、SQLite</h4>
<ol>
<li>
<p><strong>CSV/JSON</strong>:</p>
<ul>
<li>适合一次性、容量较小、对数据结构要求不高的场景。</li>
<li>直接用 Python 标准库即可读写。</li>
</ul>
</li>
<li>
<p><strong>SQLite</strong>:</p>
<ul>
<li>
<p>轻量级嵌入式数据库,无需额外部署数据库服务器。</p>
</li>
<li>
<p>适合中小规模项目,比如几万条数据。</p>
</li>
<li>
<p>示例:</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> sqlite3conn <span class="token operator">=</span> sqlite3<span class="token punctuation">.</span>connect<span class="token punctuation">(</span><span class="token string">\'data.db\'</span><span class="token punctuation">)</span>cursor <span class="token operator">=</span> conn<span class="token punctuation">.</span>cursor<span class="token punctuation">(</span><span class="token punctuation">)</span>cursor<span class="token punctuation">.</span>execute<span class="token punctuation">(</span><span class="token string">\'CREATE TABLE IF NOT EXISTS items (id INTEGER PRIMARY KEY, title TEXT, url TEXT UNIQUE)\'</span><span class="token punctuation">)</span>data <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">(</span><span class="token string">\'标题1\'</span><span class="token punctuation">,</span> <span class="token string">\'https://a.com/1\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token string">\'标题2\'</span><span class="token punctuation">,</span> <span class="token string">\'https://a.com/2\'</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token keyword">for</span> title<span class="token punctuation">,</span> url <span class="token keyword">in</span> data<span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> cursor<span class="token punctuation">.</span>execute<span class="token punctuation">(</span><span class="token string">\'INSERT INTO items (title, url) VALUES (?, ?)\'</span><span class="token punctuation">,</span> <span class="token punctuation">(</span>title<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">except</span> sqlite3<span class="token punctuation">.</span>IntegrityError<span class="token punctuation">:</span> <span class="token keyword">pass</span> <span class="token comment"># 去重</span>conn<span class="token punctuation">.</span>commit<span class="token punctuation">(</span><span class="token punctuation">)</span>conn<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
</li>
</ul>
</li>
</ol>
<h4>9.2 MySQL/PostgreSQL 等关系型数据库</h4>
<ul>
<li>
<p><strong>优点</strong>:适合大规模数据存储,支持 SQL 强大的查询功能,能更好地做数据分析、统计。</p>
</li>
<li>
<p><strong>安装</strong>:先安装对应数据库服务器(MySQL、MariaDB、PostgreSQL),然后在 Python 中安装驱动:</p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> pymysql <span class="token comment"># MySQL</span>pip <span class="token function">install</span> psycopg2 <span class="token comment"># PostgreSQL</span></code></pre>
</li>
<li>
<p><strong>示例(MySQL)</strong>:</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> pymysqlconn <span class="token operator">=</span> pymysql<span class="token punctuation">.</span>connect<span class="token punctuation">(</span>host<span class="token operator">=</span><span class="token string">\'localhost\'</span><span class="token punctuation">,</span> user<span class="token operator">=</span><span class="token string">\'root\'</span><span class="token punctuation">,</span> password<span class="token operator">=</span><span class="token string">\'root\'</span><span class="token punctuation">,</span> db<span class="token operator">=</span><span class="token string">\'spider_db\'</span><span class="token punctuation">,</span> charset<span class="token operator">=</span><span class="token string">\'utf8mb4\'</span><span class="token punctuation">)</span>cursor <span class="token operator">=</span> conn<span class="token punctuation">.</span>cursor<span class="token punctuation">(</span><span class="token punctuation">)</span>cursor<span class="token punctuation">.</span>execute<span class="token punctuation">(</span><span class="token triple-quoted-string string">\'\'\' CREATE TABLE IF NOT EXISTS articles ( id INT AUTO_INCREMENT PRIMARY KEY, title VARCHAR(255), url VARCHAR(255) UNIQUE ) CHARACTER SET utf8mb4;\'\'\'</span><span class="token punctuation">)</span>data <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">(</span><span class="token string">\'标题1\'</span><span class="token punctuation">,</span> <span class="token string">\'https://a.com/1\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token string">\'标题2\'</span><span class="token punctuation">,</span> <span class="token string">\'https://a.com/2\'</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token keyword">for</span> title<span class="token punctuation">,</span> url <span class="token keyword">in</span> data<span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> cursor<span class="token punctuation">.</span>execute<span class="token punctuation">(</span><span class="token string">\'INSERT INTO articles (title, url) VALUES (%s, %s)\'</span><span class="token punctuation">,</span> <span class="token punctuation">(</span>title<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">except</span> pymysql<span class="token punctuation">.</span>err<span class="token punctuation">.</span>IntegrityError<span class="token punctuation">:</span> <span class="token keyword">pass</span>conn<span class="token punctuation">.</span>commit<span class="token punctuation">(</span><span class="token punctuation">)</span>conn<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
</li>
</ul>
<h4>9.3 MongoDB 等 NoSQL 存储</h4>
<ul>
<li>
<p><strong>优点</strong>:文档型数据库,对半结构化 JSON 数据支持友好,可灵活存储字段不同的条目。</p>
</li>
<li>
<p><strong>安装与驱动</strong>:</p>
<ul>
<li>本地安装 MongoDB 或使用云服务;</li>
<li>Python 驱动:<code>pip install pymongo</code>。</li>
</ul>
</li>
<li>
<p><strong>示例</strong>:</p>
<pre><code class="prism language-python"><span class="token keyword">from</span> pymongo <span class="token keyword">import</span> MongoClientclient <span class="token operator">=</span> MongoClient<span class="token punctuation">(</span><span class="token string">\'mongodb://localhost:27017/\'</span><span class="token punctuation">)</span>db <span class="token operator">=</span> client<span class="token punctuation">[</span><span class="token string">\'spider_db\'</span><span class="token punctuation">]</span>collection <span class="token operator">=</span> db<span class="token punctuation">[</span><span class="token string">\'articles\'</span><span class="token punctuation">]</span><span class="token comment"># 插入或更新(去重依据:url)</span>data <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span><span class="token string">\'title\'</span><span class="token punctuation">:</span> <span class="token string">\'标题1\'</span><span class="token punctuation">,</span> <span class="token string">\'url\'</span><span class="token punctuation">:</span> <span class="token string">\'https://a.com/1\'</span><span class="token punctuation">,</span> <span class="token string">\'tags\'</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">\'新闻\'</span><span class="token punctuation">,</span> <span class="token string">\'推荐\'</span><span class="token punctuation">]</span><span class="token punctuation">}</span>collection<span class="token punctuation">.</span>update_one<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">\'url\'</span><span class="token punctuation">:</span> data<span class="token punctuation">[</span><span class="token string">\'url\'</span><span class="token punctuation">]</span><span class="token punctuation">}</span><span class="token punctuation">,</span> <span class="token punctuation">{<!-- --></span><span class="token string">\'$set\'</span><span class="token punctuation">:</span> data<span class="token punctuation">}</span><span class="token punctuation">,</span> upsert<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span></code></pre>
</li>
</ul>
<h4>9.4 Redis 用作去重与短期缓存</h4>
<ul>
<li>
<p><strong>Redis</strong>:键值存储,支持超高并发访问,非常适合做指纹去重、短期缓存、队列等。</p>
</li>
<li>
<p><strong>常见策略</strong>:</p>
<ol>
<li><strong>布隆过滤器(Bloom Filter)</strong>:当 URL 数量达到数百万级别时,普通 Python 集合会占用大量内存,布隆过滤器用空间换时间,以极少内存判断某个 URL 是否已爬取(有一定误判率)。可以使用 <code>pybloom-live</code> 或直接在 Redis 中搭建 Bloom Filter(如 RedisBloom 模块)。</li>
<li><strong>Redis Set</strong>:小规模去重可直接用 Redis set 存储已爬 URL。</li>
</ol>
<pre><code class="prism language-python"><span class="token keyword">import</span> redisr <span class="token operator">=</span> redis<span class="token punctuation">.</span>Redis<span class="token punctuation">(</span>host<span class="token operator">=</span><span class="token string">\'localhost\'</span><span class="token punctuation">,</span> port<span class="token operator">=</span><span class="token number">6379</span><span class="token punctuation">,</span> db<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">)</span>url <span class="token operator">=</span> <span class="token string">\'https://example.com/page/1\'</span><span class="token comment"># 尝试添加到 set,返回 1 表示新添加,返回 0 表示已存在</span><span class="token keyword">if</span> r<span class="token punctuation">.</span>sadd<span class="token punctuation">(</span><span class="token string">\'visited_urls\'</span><span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">\'新 URL,可爬取\'</span><span class="token punctuation">)</span><span class="token keyword">else</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">\'URL 已存在,跳过\'</span><span class="token punctuation">)</span></code></pre>
</li>
</ul>
<h4>9.5 去重策略:指纹、哈希、Bloom Filter</h4>
<ul>
<li>
<p><strong>指纹</strong>:通常对 URL 做标准化(去掉排序不同但内容相同的参数、多余的斜杠),然后对标准化后 URL 做哈希(如 MD5、SHA1),存到 Set 中对比。</p>
</li>
<li>
<p><strong>Bloom Filter</strong>:一种以极少内存做到高效去重的概率算法,对大规模 URL 判断去重十分划算,但有极小误判率(可能会把未访问的 URL 误判为已访问)。</p>
</li>
<li>
<p><strong>库推荐</strong>:</p>
<ul>
<li><code>pybloom-live</code>:纯 Python 布隆过滤器库;</li>
<li><code>redis-py-bloom</code> 或 Redis 官方 <code>RedisBloom</code> 模块(需 Redis 安装相应扩展);</li>
<li>Scrapy 内置 <code>scrapy.dupefilters.RFPDupeFilter</code>,默认用的是文件或 Redis 存储的指纹去重。</li>
</ul>
</li>
</ul>
<hr />
<h3>10. 分布式爬虫:Scrapy-Redis 与分布式调度</h3>
<p>当单机爬虫难以满足高并发、大规模抓取时,就需要分布式爬虫,将任务分布到多台机器协同完成。Scrapy-Redis 是 Scrapy 官方推荐的分布式方案之一。</p>
<h4>10.1 为什么要做分布式?</h4>
<ul>
<li><strong>海量链接</strong>:需要抓取数百万、上亿条 URL 时,单机进程/线程或协程都难以在可接受时间内完成。</li>
<li><strong>速度要求</strong>:需要更短时间内获取全量数据,提高爬取速度。</li>
<li><strong>容错与扩展</strong>:分布式部署可实现节点增减、机器故障自愈等。</li>
</ul>
<h4>10.2 Scrapy-Redis 简介与安装</h4>
<ul>
<li>
<p><strong>Scrapy-Redis</strong>:基于 Redis 存储队列与去重指纹,实现分布式调度、分布式去重、数据共享的 Scrapy 扩展。</p>
</li>
<li>
<p><strong>安装</strong>:</p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> scrapy-redis</code></pre>
</li>
</ul>
<h4>10.3 分布式去重队列与调度</h4>
<ol>
<li>
<p><strong>在 Scrapy 项目中集成 Scrapy-Redis</strong></p>
<ul>
<li>
<p>修改 <code>settings.py</code>:</p>
<pre><code class="prism language-python"><span class="token comment"># settings.py</span><span class="token comment"># 使用 redis 作为调度器</span>SCHEDULER <span class="token operator">=</span> <span class="token string">\"scrapy_redis.scheduler.Scheduler\"</span><span class="token comment"># 每次爬虫重启时是否继续未爬取完的爬取队列</span>SCHEDULER_PERSIST <span class="token operator">=</span> <span class="token boolean">True</span><span class="token comment"># 使用 redis 去重(替换默认的 RFPDupeFilter)</span>DUPEFILTER_CLASS <span class="token operator">=</span> <span class="token string">\"scrapy_redis.dupefilter.RFPDupeFilter\"</span><span class="token comment"># 指定 redis 链接地址</span>REDIS_URL <span class="token operator">=</span> <span class="token string">\'redis://:password@127.0.0.1:6379/0\'</span><span class="token comment"># 将 item 存入 redis 由其他进程或管道处理</span>ITEM_PIPELINES <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span> <span class="token string">\'scrapy_redis.pipelines.RedisPipeline\'</span><span class="token punctuation">:</span> <span class="token number">300</span><span class="token punctuation">}</span><span class="token comment"># 指定用来存储队列的 redis key 前缀</span>REDIS_ITEMS_KEY <span class="token operator">=</span> <span class="token string">\'%(spider)s:items\'</span>REDIS_START_URLS_KEY <span class="token operator">=</span> <span class="token string">\'%(name)s:start_urls\'</span></code></pre>
</li>
</ul>
</li>
<li>
<p><strong>修改 Spider</strong></p>
<ul>
<li>继承 <code>scrapy_redis.spiders.RedisSpider</code> 或 <code>RedisCrawlSpider</code>,将原本的 <code>start_urls</code> 替换为从 Redis 队列中获取种子 URL。</li>
</ul>
<pre><code class="prism language-python"><span class="token comment"># myproject/spiders/redis_quotes.py</span><span class="token keyword">from</span> scrapy_redis<span class="token punctuation">.</span>spiders <span class="token keyword">import</span> RedisSpider<span class="token keyword">from</span> myproject<span class="token punctuation">.</span>items <span class="token keyword">import</span> MyprojectItem<span class="token keyword">class</span> <span class="token class-name">RedisQuotesSpider</span><span class="token punctuation">(</span>RedisSpider<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> <span class="token string">\'redis_quotes\'</span> <span class="token comment"># Redis 中存放 start_urls 的 key</span> redis_key <span class="token operator">=</span> <span class="token string">\'redis_quotes:start_urls\'</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> quote <span class="token keyword">in</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'div.quote\'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> item <span class="token operator">=</span> MyprojectItem<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">\'text\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> quote<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'span.text::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">\'author\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> quote<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'small.author::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">\'tags\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> quote<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'div.tags a.tag::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>getall<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">yield</span> item next_page <span class="token operator">=</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'li.next a::attr(href)\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">if</span> next_page<span class="token punctuation">:</span> <span class="token keyword">yield</span> response<span class="token punctuation">.</span>follow<span class="token punctuation">(</span>next_page<span class="token punctuation">,</span> callback<span class="token operator">=</span>self<span class="token punctuation">.</span>parse<span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><strong>将种子 URL 推入 Redis</strong></p>
<ul>
<li>
<p>在本地或远程机器上,用 <code>redis-cli</code> 将种子 URL 推入列表:</p>
<pre><code class="prism language-bash">redis-clilpush redis_quotes:start_urls <span class="token string">\"https://quotes.toscrape.com/\"</span></code></pre>
</li>
</ul>
</li>
<li>
<p><strong>启动分布式爬虫</strong></p>
<ul>
<li>
<p>在多台服务器或多终端分别启动爬虫:</p>
<pre><code class="prism language-bash">scrapy crawl redis_quotes</code></pre>
</li>
<li>
<p>所有实例会从同一个 Redis 队列中获取 URL,去重也基于 Redis,互不重复。</p>
</li>
</ul>
</li>
</ol>
<h4>10.4 多机协作示例</h4>
<ol>
<li>
<p>部署多台服务器(A、B、C),都能访问同一个 Redis 实例。</p>
</li>
<li>
<p>在 A 机上运行:</p>
<pre><code class="prism language-bash">redis-server <span class="token comment"># 启动 Redis(可独立部署)</span></code></pre>
</li>
<li>
<p>在 A、B、C 机上,各自拉取完整的 Scrapy 项目代码,并配置好 <code>settings.py</code> 中的 <code>REDIS_URL</code>。</p>
</li>
<li>
<p>在 A 机或任意一处,将种子 URL 塞入 Redis:</p>
<pre><code class="prism language-bash">redis-cli <span class="token parameter variable">-h</span> A_ip <span class="token parameter variable">-p</span> <span class="token number">6379</span> lpush redis_quotes:start_urls <span class="token string">\"https://quotes.toscrape.com/\"</span></code></pre>
</li>
<li>
<p>在 A、B、C 分别运行:</p>
<pre><code class="prism language-bash">scrapy crawl redis_quotes</code></pre>
<ul>
<li>三台机器会自动协调,每台都从 Redis 队列中取 URL,去重也由 Redis 统一维护。</li>
</ul>
</li>
<li>
<p>数据收集:</p>
<ul>
<li>爬取的 Item 通过 <code>RedisPipeline</code> 自动存入 Redis 列表(key: <code>quotes:items</code>);</li>
<li>之后可通过独立脚本或 pipeline 再将数据持久化到数据库/文件。</li>
</ul>
</li>
</ol>
<hr />
<h3>11. 常见反爬与反制策略</h3>
<h4>11.1 频率限制与请求头伪装</h4>
<ol>
<li>
<p><strong>访问频率控制(限速)</strong></p>
<ul>
<li>
<p>对目标站设置随机或固定延时:</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> time<span class="token punctuation">,</span> randomtime<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span>random<span class="token punctuation">.</span>uniform<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 随机等待 1~3 秒</span></code></pre>
</li>
<li>
<p>Scrapy 中使用 <code>DOWNLOAD_DELAY</code>、<code>AUTOTHROTTLE_ENABLED</code> 等。</p>
</li>
</ul>
</li>
<li>
<p><strong>User-Agent 伪装</strong></p>
<ul>
<li>通过随机 User-Agent 模拟不同浏览器。</li>
<li>代码示例见第 4.6 节。</li>
</ul>
</li>
<li>
<p><strong>Referer、Accept-Language、Accept-Encoding 等 Headers</strong></p>
<ul>
<li>
<p>模拟真实浏览器请求时携带的完整 Header:</p>
<pre><code class="prism language-python">headers <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span> <span class="token string">\'User-Agent\'</span><span class="token punctuation">:</span> <span class="token string">\'Mozilla/5.0 ...\'</span><span class="token punctuation">,</span> <span class="token string">\'Referer\'</span><span class="token punctuation">:</span> <span class="token string">\'https://example.com/\'</span><span class="token punctuation">,</span> <span class="token string">\'Accept-Language\'</span><span class="token punctuation">:</span> <span class="token string">\'zh-CN,zh;q=0.9,en;q=0.8\'</span><span class="token punctuation">,</span> <span class="token string">\'Accept-Encoding\'</span><span class="token punctuation">:</span> <span class="token string">\'gzip, deflate, br\'</span><span class="token punctuation">,</span> <span class="token comment"># 如有需要,可带上 Cookie</span> <span class="token string">\'Cookie\'</span><span class="token punctuation">:</span> <span class="token string">\'sessionid=xxx; other=yyy\'</span><span class="token punctuation">,</span><span class="token punctuation">}</span>response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span></code></pre>
</li>
</ul>
</li>
</ol>
<h4>11.2 登录验证与 Cookie 管理</h4>
<ul>
<li>
<p><strong>Session 对象</strong>:在 <code>requests</code> 中,使用 <code>requests.Session()</code> 方便统一管理 Cookie。</p>
</li>
<li>
<p><strong>模拟登录流程</strong>:</p>
<ol>
<li>获取登录页 <code>GET</code> 请求,拿到隐藏的 token(如 CSRF);</li>
<li>结合用户名/密码、token,<code>POST</code> 到登录接口;</li>
<li>成功后,<code>session</code> 内部有了 Cookie,后续使用同一 session 发起请求即可保持登录状态。</li>
</ol>
</li>
<li>
<p><strong>带 Cookie 抓取</strong>:</p>
<pre><code class="prism language-python">session <span class="token operator">=</span> requests<span class="token punctuation">.</span>Session<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># 第一次请求,拿到 CSRF Token</span>login_page <span class="token operator">=</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'https://example.com/login\'</span><span class="token punctuation">)</span><span class="token comment"># 用 BeautifulSoup 解析隐藏 token</span><span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoupsoup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>login_page<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token string">\'lxml\'</span><span class="token punctuation">)</span>token <span class="token operator">=</span> soup<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">\'input\'</span><span class="token punctuation">,</span> <span class="token punctuation">{<!-- --></span><span class="token string">\'name\'</span><span class="token punctuation">:</span> <span class="token string">\'csrf_token\'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token string">\'value\'</span><span class="token punctuation">]</span><span class="token comment"># 构造登录表单</span>data <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span> <span class="token string">\'username\'</span><span class="token punctuation">:</span> <span class="token string">\'yourname\'</span><span class="token punctuation">,</span> <span class="token string">\'password\'</span><span class="token punctuation">:</span> <span class="token string">\'yourpwd\'</span><span class="token punctuation">,</span> <span class="token string">\'csrf_token\'</span><span class="token punctuation">:</span> token<span class="token punctuation">}</span><span class="token comment"># 登录</span>session<span class="token punctuation">.</span>post<span class="token punctuation">(</span><span class="token string">\'https://example.com/login\'</span><span class="token punctuation">,</span> data<span class="token operator">=</span>data<span class="token punctuation">,</span> headers<span class="token operator">=</span><span class="token punctuation">{<!-- --></span><span class="token string">\'User-Agent\'</span><span class="token punctuation">:</span> <span class="token string">\'...\'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token comment"># 登录成功后用 session 继续抓取需要登录才能访问的页面</span>profile <span class="token operator">=</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'https://example.com/profile\'</span><span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span>profile<span class="token punctuation">.</span>text<span class="token punctuation">)</span></code></pre>
</li>
</ul>
<h4>11.3 验证码识别(简单介绍)</h4>
<ul>
<li>
<p><strong>常见验证码类型</strong>:</p>
<ul>
<li>验证码图片(扭曲字母/数字);</li>
<li>滑动验证码(拼图/拖动)</li>
<li>点选验证码(选特定图像)</li>
<li>行为生物特征(人机验证)</li>
</ul>
</li>
<li>
<p><strong>常用方案</strong>:</p>
<ol>
<li>
<p><strong>简单 OCR 识别</strong>:用 <code>pytesseract</code> 对简单数字/字母验证码进行识别,但对扭曲度高或干扰线多的验证码成功率不高。</p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> pytesseract pillow</code></pre>
<pre><code class="prism language-python"><span class="token keyword">from</span> PIL <span class="token keyword">import</span> Image<span class="token keyword">import</span> pytesseractimg <span class="token operator">=</span> Image<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">\'captcha.png\'</span><span class="token punctuation">)</span>text <span class="token operator">=</span> pytesseract<span class="token punctuation">.</span>image_to_string<span class="token punctuation">(</span>img<span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">\'识别结果:\'</span><span class="token punctuation">,</span> text<span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><strong>打码平台/人工打码</strong>:当验证码过于复杂时,可调用第三方打码平台 API(如超级鹰、打码兔等),将图片发送给平台,由平台返回识别结果;或者简单地由人工识别。</p>
</li>
<li>
<p><strong>绕过/获取接口</strong>:很多网站的登录并不真用验证码进行提交,而是在前端校验。可以抓包找到真实的登录接口,模拟接口请求,绕过验证码。</p>
</li>
</ol>
</li>
</ul>
<h4>11.4 代理 IP 池的搭建与旋转</h4>
<ol>
<li>
<p><strong>为什么要用代理</strong></p>
<ul>
<li>同一 IP 短时间内请求次数过多容易被封禁;使用代理 IP 池可以不断切换 IP,降低单 IP 请求频率。</li>
</ul>
</li>
<li>
<p><strong>获取代理</strong></p>
<ul>
<li><strong>免费代理</strong>:网上公开的免费代理 IP,但一般不稳定、易失效。可用爬虫定期从免费代理网站(如 xicidaili、kuaidaili)抓取可用代理,并验证可用性。</li>
<li><strong>付费代理</strong>:阿布云、快代理等付费代理服务,更稳定、更安全。</li>
</ul>
</li>
<li>
<p><strong>搭建本地简单代理池示例</strong>(以免费代理为例,仅供学习)</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> requests<span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree<span class="token keyword">import</span> random<span class="token keyword">import</span> time<span class="token keyword">def</span> <span class="token function">fetch_free_proxies</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> url <span class="token operator">=</span> <span class="token string">\'https://www.kuaidaili.com/free/inha/1/\'</span> headers <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span><span class="token string">\'User-Agent\'</span><span class="token punctuation">:</span> <span class="token string">\'Mozilla/5.0 ...\'</span><span class="token punctuation">}</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span> tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> proxies <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">for</span> row <span class="token keyword">in</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">\'//table//tr\'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">:</span><span class="token punctuation">]</span><span class="token punctuation">:</span> ip <span class="token operator">=</span> row<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">\'./td[1]/text()\'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> port <span class="token operator">=</span> row<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">\'./td[2]/text()\'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> proxy <span class="token operator">=</span> <span class="token string-interpolation"><span class="token string">f\'http://</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>ip<span class="token punctuation">}</span></span><span class="token string">:</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>port<span class="token punctuation">}</span></span><span class="token string">\'</span></span> <span class="token comment"># 简单校验</span> <span class="token keyword">try</span><span class="token punctuation">:</span> r <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'https://httpbin.org/ip\'</span><span class="token punctuation">,</span> proxies<span class="token operator">=</span><span class="token punctuation">{<!-- --></span><span class="token string">\'http\'</span><span class="token punctuation">:</span> proxy<span class="token punctuation">,</span> <span class="token string">\'https\'</span><span class="token punctuation">:</span> proxy<span class="token punctuation">}</span><span class="token punctuation">,</span> timeout<span class="token operator">=</span><span class="token number">3</span><span class="token punctuation">)</span> <span class="token keyword">if</span> r<span class="token punctuation">.</span>status_code <span class="token operator">==</span> <span class="token number">200</span><span class="token punctuation">:</span> proxies<span class="token punctuation">.</span>append<span class="token punctuation">(</span>proxy<span class="token punctuation">)</span> <span class="token keyword">except</span><span class="token punctuation">:</span> <span class="token keyword">continue</span> <span class="token keyword">return</span> proxies<span class="token keyword">def</span> <span class="token function">get_random_proxy</span><span class="token punctuation">(</span>proxies<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">return</span> random<span class="token punctuation">.</span>choice<span class="token punctuation">(</span>proxies<span class="token punctuation">)</span> <span class="token keyword">if</span> proxies <span class="token keyword">else</span> <span class="token boolean">None</span><span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">\'__main__\'</span><span class="token punctuation">:</span> proxy_list <span class="token operator">=</span> fetch_free_proxies<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">\'可用代理:\'</span><span class="token punctuation">,</span> proxy_list<span class="token punctuation">)</span> <span class="token comment"># 实际爬虫中使用示例:</span> proxy <span class="token operator">=</span> get_random_proxy<span class="token punctuation">(</span>proxy_list<span class="token punctuation">)</span> <span class="token keyword">if</span> proxy<span class="token punctuation">:</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'https://example.com\'</span><span class="token punctuation">,</span> proxies<span class="token operator">=</span><span class="token punctuation">{<!-- --></span><span class="token string">\'http\'</span><span class="token punctuation">:</span> proxy<span class="token punctuation">,</span> <span class="token string">\'https\'</span><span class="token punctuation">:</span> proxy<span class="token punctuation">}</span><span class="token punctuation">,</span> timeout<span class="token operator">=</span><span class="token number">10</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>resp<span class="token punctuation">.</span>status_code<span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><strong>在 Scrapy 中配置代理</strong></p>
<ul>
<li>
<p>简单在 <code>settings.py</code> 中设置:</p>
<pre><code class="prism language-python"><span class="token comment"># settings.py</span><span class="token comment"># 下载中间件(若自定义 proxy pool、user-agent,则参照上文中间件示例)</span>DOWNLOADER_MIDDLEWARES <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span> <span class="token string">\'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware\'</span><span class="token punctuation">:</span> <span class="token number">110</span><span class="token punctuation">,</span> <span class="token string">\'myproject.middlewares.RandomProxyMiddleware\'</span><span class="token punctuation">:</span> <span class="token number">100</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token comment"># 代理列表</span>PROXY_LIST <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">\'http://ip1:port1\'</span><span class="token punctuation">,</span> <span class="token string">\'http://ip2:port2\'</span><span class="token punctuation">,</span> <span class="token comment"># ...</span><span class="token punctuation">]</span></code></pre>
</li>
<li>
<p>自定义 <code>RandomProxyMiddleware</code>:</p>
<pre><code class="prism language-python"><span class="token comment"># myproject/middlewares.py</span><span class="token keyword">import</span> random<span class="token keyword">class</span> <span class="token class-name">RandomProxyMiddleware</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> proxies<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>proxies <span class="token operator">=</span> proxies <span class="token decorator annotation punctuation">@classmethod</span> <span class="token keyword">def</span> <span class="token function">from_crawler</span><span class="token punctuation">(</span>cls<span class="token punctuation">,</span> crawler<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">return</span> cls<span class="token punctuation">(</span> proxies<span class="token operator">=</span>crawler<span class="token punctuation">.</span>settings<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'PROXY_LIST\'</span><span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">process_request</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> request<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> proxy <span class="token operator">=</span> random<span class="token punctuation">.</span>choice<span class="token punctuation">(</span>self<span class="token punctuation">.</span>proxies<span class="token punctuation">)</span> request<span class="token punctuation">.</span>meta<span class="token punctuation">[</span><span class="token string">\'proxy\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> proxy</code></pre>
</li>
<li>
<p>这样 Scrapy 在每次请求时会随机从 <code>PROXY_LIST</code> 中取一个代理。</p>
</li>
</ul>
</li>
</ol>
<hr />
<h3>12. 完整案例:爬取某新闻网站并存入数据库</h3>
<p>本节以“爬取某模拟新闻网站(示例:<code>https://news.example.com</code>)的头条新闻,并将标题、摘要、链接存入 MySQL 数据库”为例,完整演示 Scrapy + MySQL 的使用。</p>
<h4>12.1 需求分析</h4>
<ol>
<li><strong>目标数据</strong>:新闻标题、摘要(简介)、文章链接、发布时间。</li>
<li><strong>爬取范围</strong>:首页头条新闻(假设分页结构或动态加载,可视情况调整)。</li>
<li><strong>存储方式</strong>:MySQL 数据库,表名 <code>headline_news</code>,字段:<code>id, title, summary, url, pub_date</code>。</li>
<li><strong>反爬策略</strong>:设置随机 User-Agent、下载延时、简单 IP 伪装。</li>
</ol>
<h4>12.2 使用 Scrapy + MySQL 完整实现</h4>
<ol>
<li>
<p><strong>创建 Scrapy 项目</strong></p>
<pre><code class="prism language-bash">scrapy startproject news_spider<span class="token builtin class-name">cd</span> news_spider</code></pre>
</li>
<li>
<p><strong>安装依赖</strong></p>
<pre><code class="prism language-bash">pip <span class="token function">install</span> scrapy pymysql</code></pre>
</li>
<li>
<p><strong>定义 Item</strong> (<code>news_spider/items.py</code>)</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> scrapy<span class="token keyword">class</span> <span class="token class-name">NewsSpiderItem</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Item<span class="token punctuation">)</span><span class="token punctuation">:</span> title <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> summary <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> url <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> pub_date <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><strong>设置 MySQL 配置</strong> (<code>news_spider/settings.py</code>)</p>
<pre><code class="prism language-python"><span class="token comment"># Database settings</span>MYSQL_HOST <span class="token operator">=</span> <span class="token string">\'localhost\'</span>MYSQL_PORT <span class="token operator">=</span> <span class="token number">3306</span>MYSQL_USER <span class="token operator">=</span> <span class="token string">\'root\'</span>MYSQL_PASSWORD <span class="token operator">=</span> <span class="token string">\'root\'</span>MYSQL_DB <span class="token operator">=</span> <span class="token string">\'news_db\'</span>MYSQL_CHARSET <span class="token operator">=</span> <span class="token string">\'utf8mb4\'</span><span class="token comment"># Item Pipeline</span>ITEM_PIPELINES <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span> <span class="token string">\'news_spider.pipelines.MySQLPipeline\'</span><span class="token punctuation">:</span> <span class="token number">300</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token comment"># Download settings</span>ROBOTSTXT_OBEY <span class="token operator">=</span> <span class="token boolean">True</span>DOWNLOAD_DELAY <span class="token operator">=</span> <span class="token number">1</span>CONCURRENT_REQUESTS <span class="token operator">=</span> <span class="token number">8</span>USER_AGENTS_LIST <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">\'Mozilla/5.0 ... Chrome/100.0 ...\'</span><span class="token punctuation">,</span> <span class="token string">\'Mozilla/5.0 ... Firefox/110.0 ...\'</span><span class="token punctuation">,</span> <span class="token comment"># 可自行补充</span><span class="token punctuation">]</span>DOWNLOADER_MIDDLEWARES <span class="token operator">=</span> <span class="token punctuation">{<!-- --></span> <span class="token string">\'news_spider.middlewares.RandomUserAgentMiddleware\'</span><span class="token punctuation">:</span> <span class="token number">400</span><span class="token punctuation">,</span> <span class="token string">\'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware\'</span><span class="token punctuation">:</span> <span class="token boolean">None</span><span class="token punctuation">,</span><span class="token punctuation">}</span></code></pre>
</li>
<li>
<p><strong>自定义中间件:随机 User-Agent</strong> (<code>news_spider/middlewares.py</code>)</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> random<span class="token keyword">class</span> <span class="token class-name">RandomUserAgentMiddleware</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> user_agents<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>user_agents <span class="token operator">=</span> user_agents <span class="token decorator annotation punctuation">@classmethod</span> <span class="token keyword">def</span> <span class="token function">from_crawler</span><span class="token punctuation">(</span>cls<span class="token punctuation">,</span> crawler<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">return</span> cls<span class="token punctuation">(</span> user_agents<span class="token operator">=</span>crawler<span class="token punctuation">.</span>settings<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'USER_AGENTS_LIST\'</span><span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">process_request</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> request<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> ua <span class="token operator">=</span> random<span class="token punctuation">.</span>choice<span class="token punctuation">(</span>self<span class="token punctuation">.</span>user_agents<span class="token punctuation">)</span> request<span class="token punctuation">.</span>headers<span class="token punctuation">.</span>setdefault<span class="token punctuation">(</span><span class="token string">\'User-Agent\'</span><span class="token punctuation">,</span> ua<span class="token punctuation">)</span></code></pre>
</li>
<li>
<p><strong>MySQL Pipeline</strong> (<code>news_spider/pipelines.py</code>)</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> pymysql<span class="token keyword">from</span> pymysql<span class="token punctuation">.</span>err <span class="token keyword">import</span> IntegrityError<span class="token keyword">class</span> <span class="token class-name">MySQLPipeline</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">open_spider</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 连接数据库</span> self<span class="token punctuation">.</span>conn <span class="token operator">=</span> pymysql<span class="token punctuation">.</span>connect<span class="token punctuation">(</span> host<span class="token operator">=</span>spider<span class="token punctuation">.</span>settings<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'MYSQL_HOST\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> port<span class="token operator">=</span>spider<span class="token punctuation">.</span>settings<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'MYSQL_PORT\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> user<span class="token operator">=</span>spider<span class="token punctuation">.</span>settings<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'MYSQL_USER\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> password<span class="token operator">=</span>spider<span class="token punctuation">.</span>settings<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'MYSQL_PASSWORD\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> db<span class="token operator">=</span>spider<span class="token punctuation">.</span>settings<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'MYSQL_DB\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> charset<span class="token operator">=</span>spider<span class="token punctuation">.</span>settings<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'MYSQL_CHARSET\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> cursorclass<span class="token operator">=</span>pymysql<span class="token punctuation">.</span>cursors<span class="token punctuation">.</span>DictCursor <span class="token punctuation">)</span> self<span class="token punctuation">.</span>cursor <span class="token operator">=</span> self<span class="token punctuation">.</span>conn<span class="token punctuation">.</span>cursor<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 创建表</span> create_table_sql <span class="token operator">=</span> <span class="token triple-quoted-string string">\"\"\" CREATE TABLE IF NOT EXISTS headline_news ( id INT AUTO_INCREMENT PRIMARY KEY, title VARCHAR(255), summary TEXT, url VARCHAR(512) UNIQUE, pub_date DATETIME ) CHARACTER SET utf8mb4; \"\"\"</span> self<span class="token punctuation">.</span>cursor<span class="token punctuation">.</span>execute<span class="token punctuation">(</span>create_table_sql<span class="token punctuation">)</span> self<span class="token punctuation">.</span>conn<span class="token punctuation">.</span>commit<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">close_spider</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>cursor<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>conn<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">process_item</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> item<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> insert_sql <span class="token operator">=</span> <span class="token triple-quoted-string string">\"\"\" INSERT INTO headline_news (title, summary, url, pub_date) VALUES (%s, %s, %s, %s) \"\"\"</span> <span class="token keyword">try</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>cursor<span class="token punctuation">.</span>execute<span class="token punctuation">(</span>insert_sql<span class="token punctuation">,</span> <span class="token punctuation">(</span> item<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'title\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> item<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'summary\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> item<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'url\'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> item<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">\'pub_date\'</span><span class="token punctuation">)</span> <span class="token punctuation">)</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>conn<span class="token punctuation">.</span>commit<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">except</span> IntegrityError<span class="token punctuation">:</span> <span class="token comment"># URL 已存在则跳过</span> <span class="token keyword">pass</span> <span class="token keyword">return</span> item</code></pre>
</li>
<li>
<p><strong>编写 Spider</strong> (<code>news_spider/spiders/news.py</code>)</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> scrapy<span class="token keyword">from</span> news_spider<span class="token punctuation">.</span>items <span class="token keyword">import</span> NewsSpiderItem<span class="token keyword">class</span> <span class="token class-name">NewsSpider</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Spider<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> <span class="token string">\'news\'</span> allowed_domains <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">\'news.example.com\'</span><span class="token punctuation">]</span> start_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">\'https://news.example.com/\'</span><span class="token punctuation">]</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 假设首页头条新闻在 <div class="headline-list"> 下,每个新闻项 <div class="item"></span> <span class="token keyword">for</span> news <span class="token keyword">in</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'div.headline-list div.item\'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> item <span class="token operator">=</span> NewsSpiderItem<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">\'title\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> news<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'h2.title::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">\'summary\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> news<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'p.summary::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">\'url\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> response<span class="token punctuation">.</span>urljoin<span class="token punctuation">(</span>news<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'a::attr(href)\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">\'pub_date\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> news<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'span.pub-date::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 需后续转换为标准时间</span> <span class="token keyword">yield</span> scrapy<span class="token punctuation">.</span>Request<span class="token punctuation">(</span> url<span class="token operator">=</span>item<span class="token punctuation">[</span><span class="token string">\'url\'</span><span class="token punctuation">]</span><span class="token punctuation">,</span> callback<span class="token operator">=</span>self<span class="token punctuation">.</span>parse_detail<span class="token punctuation">,</span> meta<span class="token operator">=</span><span class="token punctuation">{<!-- --></span><span class="token string">\'item\'</span><span class="token punctuation">:</span> item<span class="token punctuation">}</span> <span class="token punctuation">)</span> <span class="token comment"># 假设分页结构:下一页链接在 <a class="next-page" href="..."></span> next_page <span class="token operator">=</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'a.next-page::attr(href)\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">if</span> next_page<span class="token punctuation">:</span> <span class="token keyword">yield</span> response<span class="token punctuation">.</span>follow<span class="token punctuation">(</span>next_page<span class="token punctuation">,</span> callback<span class="token operator">=</span>self<span class="token punctuation">.</span>parse<span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">parse_detail</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> item <span class="token operator">=</span> response<span class="token punctuation">.</span>meta<span class="token punctuation">[</span><span class="token string">\'item\'</span><span class="token punctuation">]</span> <span class="token comment"># 在详情页可提取更精确的发布时间</span> pub_date <span class="token operator">=</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">\'div.meta span.date::text\'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">\'pub_date\'</span><span class="token punctuation">]</span> <span class="token operator">=</span> self<span class="token punctuation">.</span>parse_date<span class="token punctuation">(</span>pub_date<span class="token punctuation">)</span> <span class="token keyword">yield</span> item <span class="token keyword">def</span> <span class="token function">parse_date</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> date_str<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 假设 date_str 格式为 \'2025-05-30 14:30:00\'</span> <span class="token keyword">from</span> datetime <span class="token keyword">import</span> datetime <span class="token keyword">try</span><span class="token punctuation">:</span> dt <span class="token operator">=</span> datetime<span class="token punctuation">.</span>strptime<span class="token punctuation">(</span>date_str<span class="token punctuation">,</span> <span class="token string">\'%Y-%m-%d %H:%M:%S\'</span><span class="token punctuation">)</span> <span class="token keyword">return</span> dt <span class="token keyword">except</span><span class="token punctuation">:</span> <span class="token keyword">return</span> <span class="token boolean">None</span></code></pre>
</li>
<li>
<p><strong>运行爬虫</strong></p>
<ul>
<li>
<p>确保 MySQL 已创建数据库 <code>news_db</code>,用户名、密码正确;</p>
</li>
<li>
<p>在项目根目录执行:</p>
<pre><code class="prism language-bash">scrapy crawl news</code></pre>
</li>
<li>
<p>运行期间,日志会显示抓取进度,成功后可在 <code>headline_news</code> 表中查看抓取结果:</p>
<pre><code class="prism language-sql"><span class="token keyword">SELECT</span> <span class="token operator">*</span> <span class="token keyword">FROM</span> headline_news <span class="token keyword">LIMIT</span> <span class="token number">10</span><span class="token punctuation">;</span></code></pre>
</li>
</ul>
</li>
</ol>
<h4>12.3 代码详解与常见 Q&A</h4>
<ul>
<li>
<p><strong>Q:为什么要在 <code>parse</code> 方法中发起新的 Request 到详情页?</strong></p>
<ul>
<li>因为首页展示的数据有限,有些字段(如精确发布时间、作者、正文)要到详情页才能拿到。<code>meta</code> 参数可将部分已抓取的字段传递到下一个回调。</li>
</ul>
</li>
<li>
<p><strong>Q:如何将字符串 <code>\'2025-05-30 14:30:00\'</code> 转为 <code>datetime</code>?</strong></p>
<ul>
<li>使用 Python 标准库 <code>datetime.strptime</code>,传入对应格式;若格式不一致,可先 <code>strip()</code> 或正则提取。</li>
</ul>
</li>
<li>
<p><strong>Q:如果目标网站有登录或验证码怎么办?</strong></p>
<ul>
<li>可在 <code>start_requests</code> 方法里模拟登录(使用 <code>requests</code> + <code>cookies</code> 或 Selenium),登录后获取 Cookie,再将 Cookie 带入 Scrapy 调用。</li>
</ul>
</li>
<li>
<p><strong>Q:如何处理分页数量巨大(上千页)?</strong></p>
<ul>
<li>可分析 URL 规律(如 <code>page=1,2,3...</code>),使用 <code>for page in range(1, 1001): yield scrapy.Request(...)</code>。注意限速与 IP 轮换,防止被封。</li>
</ul>
</li>
<li>
<p><strong>Q:为什么要随机 User-Agent?</strong></p>
<ul>
<li>防止被网站识别为爬虫。</li>
</ul>
</li>
<li>
<p><strong>Q:如何在 Scrapy 中使用代理?</strong></p>
<ul>
<li>参考第 11.4 节,在 <code>DOWNLOADER_MIDDLEWARES</code> 中配置自己的 <code>RandomProxyMiddleware</code>,或直接使用 Scrapy-Proxy-Pool 等库。</li>
</ul>
</li>
</ul>
<hr />
<h3>13. Python 爬虫相关的常用第三方库一览(截至 2025年6月)</h3>
<p>以下对各类常用库进行分类归纳,并附简要说明与典型使用场景。</p>
<h4>13.1 基础请求与解析</h4>
<thead>
<th>库 名</th>
<th>功能简介</th>
<th>典型场景</th>
</thead>
<tbody>
<td><strong>requests</strong></td>
<td>同步 HTTP 请求,API 简洁,生态成熟</td>
<td>绝大多数简单爬虫,表单提交、Cookie 支持</td>
<td><strong>httpx</strong></td>
<td>支持同步 & 异步的 HTTP 客户端,与 requests 兼容</td>
<td>需要异步或更多高级功能时的首选</td>
<td><strong>aiohttp</strong></td>
<td>原生 asyncio 协程模式的 HTTP 客户端</td>
<td>高并发抓取、异步爬虫</td>
<td><strong>urllib3</strong></td>
<td>低级 HTTP 客户端,requests 底层依赖</td>
<td>需要更底层的控制、定制化管理连接池时</td>
<td><strong>BeautifulSoup (bs4)</strong></td>
<td>HTML/XML 解析,入门简单、灵活</td>
<td>初学者快速上手、解析复杂 HTML</td>
<td><strong>lxml</strong></td>
<td>基于 libxml2/libxslt 的高性能解析器,支持 XPath</td>
<td>需要高性能、大量数据解析时,结合 XPath 提取</td>
<td><strong>parsel</strong></td>
<td>Scrapy 自带的解析器,支持 CSS/XPath</td>
<td>Scrapy 项目中快捷解析、项目外独立解析</td>
<td><strong>PyQuery</strong></td>
<td>类似 jQuery 的解析 API,基于 lxml</td>
<td>前端同学更习惯 CSS 选择器,快速上手</td>
<td><strong>re (正则)</strong></td>
<td>Python 内置正则模块,对结构简单的文本进行模式匹配</td>
<td>提取邮箱、电话号码、URL、数字等简单模式</td>
<td><strong>html5lib</strong></td>
<td>兼容性最强的解析器(支持容错 HTML),速度相对较慢</td>
<td>需要解析结构严重不规范的 HTML 时</td>
</tbody>
<h4>13.2 浏览器自动化</h4>
<thead>
<th>库 名</th>
<th>功能简介</th>
<th>典型场景</th>
</thead>
<tbody>
<td><strong>Selenium</strong></td>
<td>最成熟的浏览器自动化框架,支持 Chrome、Firefox、Edge 等</td>
<td>需模拟用户操作 (点击、滑动、表单提交)、抓取 JS 渲染内容</td>
<td><strong>Playwright</strong></td>
<td>微软出品,继承 Puppeteer,API 简洁,支持多浏览器</td>
<td>高性能 headless 模式,异步/同步模式都支持</td>
<td><strong>Pyppeteer</strong></td>
<td>Puppeteer 的 Python 移植版</td>
<td>Node.js 用户转 Python 时快速上手</td>
<td><strong>undetected-chromedriver</strong></td>
<td>对抗反爬,屏蔽 Selenium 特征</td>
<td>需要更强的逃避检测能力,尤其面对高级反爬</td>
<td><strong>Splash</strong></td>
<td>由 Scrapy-Splash 提供,基于 QtWebKit 的渲染服务</td>
<td>Scrapy 与动态渲染结合,用于批量异步渲染</td>
</tbody>
<h4>13.3 异步爬取</h4>
<thead>
<th>库 名</th>
<th>功能简介</th>
<th>典型场景</th>
</thead>
<tbody>
<td><strong>asyncio</strong></td>
<td>Python 标准库,提供事件循环与异步协程基础</td>
<td>编写异步爬虫主框架</td>
<td><strong>aiohttp</strong></td>
<td>基于 asyncio 的 HTTP 客户端</td>
<td>高并发抓取、配合 BeautifulSoup/lxml 解析</td>
<td><strong>httpx</strong></td>
<td>支持同步 & 异步,与 requests 接口兼容</td>
<td>希望无缝从 requests 切换到异步模式</td>
<td><strong>trio</strong></td>
<td>另一个异步框架,示意图结构友好,但生态相对较小</td>
<td>深度研究异步原理或希望新尝试</td>
<td><strong>curio</strong></td>
<td>纯 Python 异步库,强调简洁</td>
<td>研究异步 I/O 原理的场景</td>
<td><strong>aiofiles</strong></td>
<td>异步文件操作</td>
<td>异步模式下同时要读写大量文件</td>
</tbody>
<h4>13.4 登录模拟与验证码处理</h4>
<thead>
<th>库 名</th>
<th>功能简介</th>
<th>典型场景</th>
</thead>
<tbody>
<td><strong>requests</strong> + <strong>Session</strong></td>
<td>模拟登录,自动管理 Cookie</td>
<td>大部分需要登录后抓取的场景</td>
<td><strong>selenium</strong></td>
<td>浏览器自动化登录,执行 JS,处理复杂登录逻辑</td>
<td>登录时有 JS 加密或动态 token</td>
<td><strong>Playwright</strong></td>
<td>与 Selenium 类似,但速度更快,接口更现代</td>
<td>更轻量级的浏览器自动化</td>
<td><strong>pytesseract</strong></td>
<td>OCR 识别图片文字</td>
<td>简单验证码识别</td>
<td><strong>captcha_solver</strong></td>
<td>第三方打码平台 SDK</td>
<td>需要调用付费打码平台处理验证码</td>
<td><strong>twoCaptcha</strong></td>
<td>付费打码平台 Python 客户端</td>
<td>需要可靠的验证码打码服务</td>
</tbody>
<h4>13.5 反爬与代理</h4>
<thead>
<th>库 名</th>
<th>功能简介</th>
<th>典型场景</th>
</thead>
<tbody>
<td><strong>fake-useragent</strong></td>
<td>随机生成 User-Agent</td>
<td>防止被识别为爬虫</td>
<td><strong>scrapy-fake-useragent</strong></td>
<td>Scrapy 专用随机 UA 插件</td>
<td>Scrapy 项目中一键启用随机 UA</td>
<td><strong>requests-random-user-agent</strong></td>
<td>为 requests 提供随机 UA 支持</td>
<td>轻松控制 requests 请求头</td>
<td><strong>scrapy-rotating-proxies</strong></td>
<td>Scrapy 专用代理轮换中间件,用于自动切换代理池(付费或免费)</td>
<td>Scrapy 大规模抓取时避免单 IP 封禁</td>
<td><strong>scrapy-proxies</strong></td>
<td>开源 Scrapy 代理中间件,可使用免费代理池</td>
<td>入门级 Scrapy 项目快速使用代理</td>
<td><strong>proxylist2</strong></td>
<td>Python 包,从多个免费代理网站抓取代理 IP</td>
<td>自动化维护免费代理列表</td>
<td><strong>requests-redis-rotating-proxies</strong></td>
<td>结合 Redis 存储代理列表,实现高可用代理池</td>
<td>中大型项目需集中管理代理 IP</td>
<td><strong>scrapy-user-agents</strong></td>
<td>Scrapy 插件,内置常见 UA 列表</td>
<td>简化 Scrapy 中的 UA 列表管理</td>
<td><strong>cfscrape</strong></td>
<td>用于绕过 Cloudflare 简易 JS 保护</td>
<td>某些站点需要绕过 Cloudflare 5 秒验证页面</td>
</tbody>
<h4>13.6 分布式调度</h4>
<thead>
<th>库 名</th>
<th>功能简介</th>
<th>典型场景</th>
</thead>
<tbody>
<td><strong>scrapy-redis</strong></td>
<td>Scrapy 分布式爬虫扩展,统一 Redis 作为队列与去重存储</td>
<td>分布式 Scrapy 项目</td>
<td><strong>scrapy-cluster</strong></td>
<td>基于 Kafka + Redis 的 Scrapy 分布式爬虫系统</td>
<td>企业级分布式环境,需与消息队列协同</td>
<td><strong>Frigate</strong></td>
<td>高性能分布式爬虫,结合 Redis + MongoDB</td>
<td>大规模分布式爬取且需要与 NoSQL 存储集成</td>
<td><strong>PhantomJS + Splash</strong></td>
<td>无头浏览器渲染服务,可与 Scrapy 搭配形成分布式渲染环境</td>
<td>需要大规模渲染 JS 页面后再抓取</td>
</tbody>
<h4>13.7 其它有用工具</h4>
<thead>
<th>库 名</th>
<th>功能简介</th>
<th>典型场景</th>
</thead>
<tbody>
<td><strong>robotparser</strong></td>
<td>Python 内置 <code>urllib.robotparser</code>,解析 robots.txt</td>
<td>爬虫前先检查 robots.txt</td>
<td><strong>tldextract</strong></td>
<td>提取域名、子域名、后缀</td>
<td>需要对 URL 做域名归类或统计时</td>
<td><strong>url-normalize</strong></td>
<td>URL 规范化,去除重复查询参数</td>
<td>爬虫过程对 URL 进行标准化去重</td>
<td><strong>logging</strong></td>
<td>Python 标准库,用于日志输出</td>
<td>任何爬虫项目都应进行日志记录</td>
<td><strong>fake_useragent</strong></td>
<td>动态获取并生成随机 UA</td>
<td>避免 UA 列表过时</td>
<td><strong>termcolor</strong></td>
<td>终端字符着色,调试输出更直观</td>
<td>爬虫日志、调试时需要彩色提示</td>
<td><strong>psutil</strong></td>
<td>系统资源监控,可查看 CPU、内存占用</td>
<td>长时间运行爬虫时监控资源使用情况</td>
<td><strong>schedule</strong></td>
<td>定时任务库,可定时运行脚本</td>
<td>需要定时执行爬虫任务</td>
<td><strong>watchdog</strong></td>
<td>文件系统监控,当文件/目录变化时触发回调</td>
<td>实时监控爬取结果文件、触发后续任务</td>
</tbody>
<blockquote>
<p><strong>说明</strong>:因篇幅所限,上表仅列出截至 2024 年底常用或较为稳定的 Python 爬虫库,后续可能有新库或旧库迭代,请根据实际需求及时查阅官方文档或社区资源。</p>
</blockquote>
<hr />
<h3>14. 附录</h3>
<h4>14.1 常见报错及解决方案</h4>
<ol>
<li>
<p><strong><code>ModuleNotFoundError: No module named \'xxx\'</code></strong></p>
<ul>
<li>原因:未安装该包或安装在全局而非虚拟环境中。</li>
<li>解决:确认当前虚拟环境是否已激活,并执行 <code>pip install xxx</code>。</li>
</ul>
</li>
<li>
<p><strong><code>requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED]</code></strong></p>
<ul>
<li>
<p>原因:本机 CA 证书有问题,无法验证 HTTPS。</p>
</li>
<li>
<p>解决:</p>
<ul>
<li>升级 <code>certifi</code>:<code>pip install --upgrade certifi</code>;</li>
<li>临时忽略:<code>requests.get(url, verify=False)</code>(不推荐用于生产)。</li>
</ul>
</li>
</ul>
</li>
<li>
<p><strong><code>ValueError: too many values to unpack (expected 2)</code> 在 XPath 返回多值时</strong></p>
<ul>
<li>原因:使用 <code>for x, y in tree.xpath(...)</code>,但 XPath 返回值数量与预期不符。</li>
<li>解决:检查 XPath 语法,或者使用 <code>zip()</code> 将两个列表匹配。</li>
</ul>
</li>
<li>
<p><strong><code>selenium.common.exceptions.WebDriverException: Message: \'chromedriver\' executable needs to be in PATH</code></strong></p>
<ul>
<li>原因:<code>chromedriver</code> 未放在系统 PATH,或路径不正确。</li>
<li>解决:下载与 Chrome 版本一致的 <code>chromedriver</code>,并将其路径添加到环境变量,或者在代码中指定 <code>executable_path</code>。</li>
</ul>
</li>
<li>
<p><strong><code>pymysql.err.OperationalError: (1045, \"Access denied for user \'root\'@\'localhost\' (using password: YES)\")</code></strong></p>
<ul>
<li>原因:MySQL 用户名/密码、权限或 MySQL 服务未启动。</li>
<li>解决:检查用户名、密码是否正确,MySQL 服务是否运行,数据库名称是否存在。</li>
</ul>
</li>
<li>
<p><strong><code>TimeoutError</code> 或 <code>asyncio.exceptions.TimeoutError</code></strong></p>
<ul>
<li>原因:网络慢或被目标站点限制。</li>
<li>解决:加大 <code>timeout</code> 参数,降低并发数,适当设置代理。</li>
</ul>
</li>
<li>
<p><strong>UnicodeEncodeError/UnicodeDecodeError</strong></p>
<ul>
<li>原因:处理的文本编码与 Python 默认编码不一致。</li>
<li>解决:明确指定 <code>response.encoding = \'utf-8\'</code>,或者在读写文件时加 <code>encoding=\'utf-8\'</code>。</li>
</ul>
</li>
</ol>
<h4>14.2 常用 HTTP 状态码速查</h4>
<thead>
<th>状态码</th>
<th>含义</th>
</thead>
<tbody>
<td>200</td>
<td>OK,请求成功</td>
<td>301</td>
<td>永久重定向</td>
<td>302</td>
<td>临时重定向</td>
<td>400</td>
<td>Bad Request,请求报文语法错误</td>
<td>401</td>
<td>Unauthorized,需要身份验证</td>
<td>403</td>
<td>Forbidden,服务器拒绝访问(常见反爬屏蔽码)</td>
<td>404</td>
<td>Not Found,资源不存在</td>
<td>405</td>
<td>Method Not Allowed,请求方法被禁止</td>
<td>408</td>
<td>Request Timeout,服务器等待客户端发送请求超时</td>
<td>429</td>
<td>Too Many Requests,客户端请求频率过高</td>
<td>500</td>
<td>Internal Server Error,服务器内部错误</td>
<td>502</td>
<td>Bad Gateway,服务器作为网关或代理时收到上游服务器无效响应</td>
<td>503</td>
<td>Service Unavailable,服务器暂时无法处理请求,常见于流量过大被限流</td>
</tbody>
<h4>14.3 学习资源与进阶指南</h4>
<ol>
<li>
<p><strong>官方文档</strong></p>
<ul>
<li>Requests:https://docs.python-requests.org/</li>
<li>BeautifulSoup:http://beautifulsoup.readthedocs.io/</li>
<li>Scrapy:https://docs.scrapy.org/</li>
<li>Selenium:https://www.selenium.dev/documentation/</li>
<li>Playwright:https://playwright.dev/python/</li>
<li>aiohttp:https://docs.aiohttp.org/</li>
<li>httpx:https://www.python-httpx.org/</li>
</ul>
</li>
<li>
<p><strong>推荐书籍</strong></p>
<ul>
<li>《Python网络数据采集(第二版)》—— Ryan Mitchell</li>
<li>《深入Python爬虫框架 Scrapy》—— 黄今</li>
<li>《Python3网络爬虫开发实战》—— 石刚</li>
</ul>
</li>
<li>
<p><strong>课程与视频</strong></p>
<ul>
<li>B 站、YouTube 上均有优质 Python 爬虫视频教程(可搜索“Python 爬虫 零基础”、“Scrapy 教程”等)。</li>
<li>Coursera/慕课网上的 Python 爬虫进阶课程。</li>
</ul>
</li>
<li>
<p><strong>社区资源</strong></p>
<ul>
<li>Stack Overflow:https://stackoverflow.com/(遇到报错可搜索)</li>
<li>SegmentFault:https://segmentfault.com/(国内开发者社区)</li>
<li>GitHub Trending:搜索开源爬虫项目,学习最佳实践。</li>
</ul>
</li>
</ol>
<hr />
<h3>15. 总结</h3>
<p>本教程从最基础的 <code>requests + BeautifulSoup</code>,到 Scrapy 框架、浏览器自动化、异步爬虫、分布式爬虫,系统梳理了 Python 爬虫的常见技术与实践要点,并盘点了截至 2024 年底的主流库与工具。对于初学者而言,掌握以下几个关键点即可快速上手:</p>
<ol>
<li><strong>理解 HTTP 基础</strong>:会构造 GET/POST 请求、分析响应;</li>
<li><strong>掌握 HTML 解析</strong>:熟悉 BeautifulSoup、lxml(XPath/CSS Selector);</li>
<li><strong>尝试 Scrapy</strong>:学会搭建 Scrapy 项目、编写 Spider、Pipeline、Settings,并用 Scrapy Shell 调试;</li>
<li><strong>应对动态页面</strong>:熟练使用 Selenium 或 Playwright 抓取 JS 渲染内容,并结合常规解析方法提取数据;</li>
<li><strong>探索异步爬虫</strong>:理解协程原理,用 aiohttp、httpx 提升并发性能;</li>
<li><strong>数据存储与去重</strong>:掌握 CSV/JSON/SQLite/MySQL/MongoDB 的使用,并做好 URL 去重(集合、Redis、Bloom Filter);</li>
<li><strong>反爬与反制</strong>:设置 User-Agent、Referer、下载延时、代理 IP 池等,了解验证码处理思路;</li>
<li><strong>分布式爬虫</strong>:学习 Scrapy-Redis,将任务分配到多台机器,提高抓取效率。</li>
</ol>
<p>最后,爬虫技术更新迅速,截止到本教程编写时(2024 年底)的主流库可能会随着技术迭代、站点反爬升级而发生变化。建议你在入门后,积极关注各大 Python 社区、GitHub Trending 以及官方文档,及时跟进新特性、新库、新思路,不断优化自己的爬虫方案。祝你能在数据抓取的道路上越走越远,愉快地玩转 Python 爬虫世界!</p>
<hr />
<p><em>创作时间:2025 年 6 月 1 日</em></p>
</div>
				
				               	<div class="clear"></div>
                			

				                <div class="article_tags">
                	<div class="tagcloud">
                    	网络标签:<a href="http://www.csdndoc.com/tag/fbs" rel="tag">分布式</a> <a href="http://www.csdndoc.com/tag/pc-3" rel="tag">爬虫</a> <a href="http://www.csdndoc.com/tag/sl" rel="tag">示例</a>                    </div>
                </div>
				
             </div>
		</div>
    

			
    
		<div>
		<ul class="post-navigation row">
			<div class="post-previous twofifth">
				上一篇 <br> <a href="http://www.csdndoc.com/thread/6197.html" rel="prev">华为OD机试2025B卷 - 最少面试官 / 招聘 (C++ & Python & JAVA & JS & GO)</a>            </div>
            <div class="post-next twofifth">
				下一篇 <br> <a href="http://www.csdndoc.com/thread/6199.html" rel="next">AI全栈之路:Cursor+Claude3.7一整套APP原型图UI生成_claude cursor</a>            </div>
        </ul>
	</div>
	     
	<div class="article_container row  box article_related">
    	<div class="related">
		<div class="newrelated">
    <h2>相关问题</h2>
    <ul>
                        <li><a href="http://www.pcgg.com.cn/lol/24964.html">lol感恩碎片怎么得</a></li>
                            <li><a href="http://www.pcgg.com.cn/ys/45461.html">原神无间怎么打</a></li>
                            <li><a href="http://www.pcgg.com.cn/lol/20036.html">lol怎么拿到英雄</a></li>
                            <li><a href="http://www.pcgg.com.cn/gl/4776.html">原神痛苦弓为什么叫痛苦弓</a></li>
                            <li><a href="http://www.pcgg.com.cn/ys/42578.html">原神诗琴乐谱怎么获得</a></li>
                            <li><a href="http://www.pcgg.com.cn/aedfh/38233.html">艾尔登法环怎么无伤打白龙</a></li>
                            <li><a href="http://www.pcgg.com.cn/ys/46465.html">原神星汉翅膀怎么获得</a></li>
                            <li><a href="http://www.pcgg.com.cn/lol/16248.html">英雄联盟拉克丝皮肤哪个好</a></li>
                            <li><a href="http://www.pcgg.com.cn/lol/17918.html">lol对局内怎么回复</a></li>
                            <li><a href="http://www.pcgg.com.cn/xjzb/55709.html">星际争霸人族坦克攻略</a></li>
                </ul>
</div>
       	</div>
	</div>
         	<div class="clear"></div>
	<div id="comments_box">

    </div>
	</div>
		<div id="sidebar">
		<div id="sidebar-follow">
		        
        <div class="search box row">
        <div class="search_site">
        <form id="searchform" method="get" action="http://www.csdndoc.com/index.php">
            <button type="submit" value="" id="searchsubmit" class="button"><i class="fasearch">☚</i></button>
            <label><input type="text" class="search-s" name="s" x-webkit-speech="" placeholder="请输入搜索内容"></label>
        </form></div></div>
        <div class="widget_text widget box row widget_custom_html"><h3>公告</h3><div class="textwidget custom-html-widget"><a target="_blank" href="http://www.5d.ink/deepseek/?d=DeepseekR1_local.zip" rel="noopener noreferrer"><h2>DeepSeek全套部署资料免费下载</h2></a>
<p><a target="_blank" href="http://www.5d.ink/deepseek/?d=DeepseekR1_local.zip" rel="noopener noreferrer"><img src="http://css.5d.ink/img/deep.png" alt="DeepSeekR1本地部署部署资料免费下载"></a></p><br /><br />
<a target="_blank" href="http://www.5d.ink/freefonts/?d=FreeFontsdown.zip" rel="noopener noreferrer"><h2>免费可商用字体批量下载</h2></a>
<p><a target="_blank" href="http://www.5d.ink/freefonts/?d=FreeFontsdown.zip" rel="noopener noreferrer"><img src="http://css.5d.ink/img/freefont.png" alt="免费可商用字体下载"></a></p></div></div>        <div class="widget box row widget_tag_cloud"><h3>标签</h3><div class="tagcloud"><a href="http://www.csdndoc.com/tag/ck-2" class="tag-cloud-link tag-link-237 tag-link-position-1" style="font-size: 8.5773195876289pt;" aria-label="仓库 (345个项目)">仓库</a>
<a href="http://www.csdndoc.com/tag/dm" class="tag-cloud-link tag-link-47 tag-link-position-2" style="font-size: 16.515463917526pt;" aria-label="代码 (1,216个项目)">代码</a>
<a href="http://www.csdndoc.com/tag/ys" class="tag-cloud-link tag-link-62 tag-link-position-3" style="font-size: 10.164948453608pt;" aria-label="元素 (447个项目)">元素</a>
<a href="http://www.csdndoc.com/tag/hs" class="tag-cloud-link tag-link-38 tag-link-position-4" style="font-size: 14.350515463918pt;" aria-label="函数 (868个项目)">函数</a>
<a href="http://www.csdndoc.com/tag/gn" class="tag-cloud-link tag-link-48 tag-link-position-5" style="font-size: 9.0103092783505pt;" aria-label="功能 (373个项目)">功能</a>
<a href="http://www.csdndoc.com/tag/qk" class="tag-cloud-link tag-link-324 tag-link-position-6" style="font-size: 9.1546391752577pt;" aria-label="区块 (376个项目)">区块</a>
<a href="http://www.csdndoc.com/tag/cs" class="tag-cloud-link tag-link-25 tag-link-position-7" style="font-size: 9.1546391752577pt;" aria-label="参数 (377个项目)">参数</a>
<a href="http://www.csdndoc.com/tag/ml" class="tag-cloud-link tag-link-4 tag-link-position-8" style="font-size: 11.896907216495pt;" aria-label="命令 (590个项目)">命令</a>
<a href="http://www.csdndoc.com/tag/tx" class="tag-cloud-link tag-link-130 tag-link-position-9" style="font-size: 9.4432989690722pt;" aria-label="图像 (395个项目)">图像</a>
<a href="http://www.csdndoc.com/tag/zzl" class="tag-cloud-link tag-link-20 tag-link-position-10" style="font-size: 21.422680412371pt;" aria-label="在这里 (2,688个项目)">在这里</a>
<a href="http://www.csdndoc.com/tag/dz" class="tag-cloud-link tag-link-196 tag-link-position-11" style="font-size: 10.020618556701pt;" aria-label="地址 (432个项目)">地址</a>
<a href="http://www.csdndoc.com/tag/khd" class="tag-cloud-link tag-link-28 tag-link-position-12" style="font-size: 8.5773195876289pt;" aria-label="客户端 (344个项目)">客户端</a>
<a href="http://www.csdndoc.com/tag/rq" class="tag-cloud-link tag-link-215 tag-link-position-13" style="font-size: 11.030927835052pt;" aria-label="容器 (514个项目)">容器</a>
<a href="http://www.csdndoc.com/tag/dx" class="tag-cloud-link tag-link-34 tag-link-position-14" style="font-size: 9.1546391752577pt;" aria-label="对象 (379个项目)">对象</a>
<a href="http://www.csdndoc.com/tag/gj" class="tag-cloud-link tag-link-43 tag-link-position-15" style="font-size: 10.164948453608pt;" aria-label="工具 (441个项目)">工具</a>
<a href="http://www.csdndoc.com/tag/kfz" class="tag-cloud-link tag-link-294 tag-link-position-16" style="font-size: 11.175257731959pt;" aria-label="开发者 (529个项目)">开发者</a>
<a href="http://www.csdndoc.com/tag/js" class="tag-cloud-link tag-link-283 tag-link-position-17" style="font-size: 10.59793814433pt;" aria-label="技术 (475个项目)">技术</a>
<a href="http://www.csdndoc.com/tag/jk" class="tag-cloud-link tag-link-252 tag-link-position-18" style="font-size: 8.5773195876289pt;" aria-label="接口 (345个项目)">接口</a>
<a href="http://www.csdndoc.com/tag/cj" class="tag-cloud-link tag-link-68 tag-link-position-19" style="font-size: 8pt;" aria-label="插件 (316个项目)">插件</a>
<a href="http://www.csdndoc.com/tag/crtp" class="tag-cloud-link tag-link-42 tag-link-position-20" style="font-size: 16.80412371134pt;" aria-label="插入图片 (1,273个项目)">插入图片</a>
<a href="http://www.csdndoc.com/tag/cz-3" class="tag-cloud-link tag-link-513 tag-link-position-21" style="font-size: 8.8659793814433pt;" aria-label="操作 (363个项目)">操作</a>
<a href="http://www.csdndoc.com/tag/sj" class="tag-cloud-link tag-link-55 tag-link-position-22" style="font-size: 22pt;" aria-label="数据 (2,939个项目)">数据</a>
<a href="http://www.csdndoc.com/tag/sjk" class="tag-cloud-link tag-link-124 tag-link-position-23" style="font-size: 10.164948453608pt;" aria-label="数据库 (446个项目)">数据库</a>
<a href="http://www.csdndoc.com/tag/sz-3" class="tag-cloud-link tag-link-186 tag-link-position-24" style="font-size: 9.4432989690722pt;" aria-label="数组 (396个项目)">数组</a>
<a href="http://www.csdndoc.com/tag/wj" class="tag-cloud-link tag-link-81 tag-link-position-25" style="font-size: 18.247422680412pt;" aria-label="文件 (1,619个项目)">文件</a>
<a href="http://www.csdndoc.com/tag/ff" class="tag-cloud-link tag-link-18 tag-link-position-26" style="font-size: 11.175257731959pt;" aria-label="方法 (525个项目)">方法</a>
<a href="http://www.csdndoc.com/tag/fwq" class="tag-cloud-link tag-link-147 tag-link-position-27" style="font-size: 13.340206185567pt;" aria-label="服务器 (748个项目)">服务器</a>
<a href="http://www.csdndoc.com/tag/mx" class="tag-cloud-link tag-link-69 tag-link-position-28" style="font-size: 19.40206185567pt;" aria-label="模型 (1,962个项目)">模型</a>
<a href="http://www.csdndoc.com/tag/cs-2" class="tag-cloud-link tag-link-58 tag-link-position-29" style="font-size: 12.907216494845pt;" aria-label="测试 (684个项目)">测试</a>
<a href="http://www.csdndoc.com/tag/xx-2" class="tag-cloud-link tag-link-35 tag-link-position-30" style="font-size: 8.1443298969072pt;" aria-label="消息 (320个项目)">消息</a>
<a href="http://www.csdndoc.com/tag/bb" class="tag-cloud-link tag-link-6 tag-link-position-31" style="font-size: 13.340206185567pt;" aria-label="版本 (738个项目)">版本</a>
<a href="http://www.csdndoc.com/tag/zt" class="tag-cloud-link tag-link-79 tag-link-position-32" style="font-size: 8pt;" aria-label="状态 (313个项目)">状态</a>
<a href="http://www.csdndoc.com/tag/hj" class="tag-cloud-link tag-link-3 tag-link-position-33" style="font-size: 9.8762886597938pt;" aria-label="环境 (421个项目)">环境</a>
<a href="http://www.csdndoc.com/tag/yh" class="tag-cloud-link tag-link-44 tag-link-position-34" style="font-size: 14.20618556701pt;" aria-label="用户 (845个项目)">用户</a>
<a href="http://www.csdndoc.com/tag/sl" class="tag-cloud-link tag-link-17 tag-link-position-35" style="font-size: 10.164948453608pt;" aria-label="示例 (448个项目)">示例</a>
<a href="http://www.csdndoc.com/tag/cx" class="tag-cloud-link tag-link-31 tag-link-position-36" style="font-size: 9.7319587628866pt;" aria-label="程序 (414个项目)">程序</a>
<a href="http://www.csdndoc.com/tag/sf" class="tag-cloud-link tag-link-108 tag-link-position-37" style="font-size: 9.7319587628866pt;" aria-label="算法 (412个项目)">算法</a>
<a href="http://www.csdndoc.com/tag/xt" class="tag-cloud-link tag-link-96 tag-link-position-38" style="font-size: 13.484536082474pt;" aria-label="系统 (762个项目)">系统</a>
<a href="http://www.csdndoc.com/tag/xc" class="tag-cloud-link tag-link-19 tag-link-position-39" style="font-size: 8.7216494845361pt;" aria-label="线程 (350个项目)">线程</a>
<a href="http://www.csdndoc.com/tag/zj" class="tag-cloud-link tag-link-192 tag-link-position-40" style="font-size: 9.8762886597938pt;" aria-label="组件 (422个项目)">组件</a>
<a href="http://www.csdndoc.com/tag/jd" class="tag-cloud-link tag-link-12 tag-link-position-41" style="font-size: 14.061855670103pt;" aria-label="节点 (825个项目)">节点</a>
<a href="http://www.csdndoc.com/tag/sb" class="tag-cloud-link tag-link-160 tag-link-position-42" style="font-size: 9.7319587628866pt;" aria-label="设备 (413个项目)">设备</a>
<a href="http://www.csdndoc.com/tag/lj" class="tag-cloud-link tag-link-22 tag-link-position-43" style="font-size: 10.164948453608pt;" aria-label="路径 (445个项目)">路径</a>
<a href="http://www.csdndoc.com/tag/jx" class="tag-cloud-link tag-link-213 tag-link-position-44" style="font-size: 11.896907216495pt;" aria-label="镜像 (588个项目)">镜像</a>
<a href="http://www.csdndoc.com/tag/xm" class="tag-cloud-link tag-link-171 tag-link-position-45" style="font-size: 14.494845360825pt;" aria-label="项目 (891个项目)">项目</a></div>
</div>        <div class="widget box row">
            <div id="tab-title">
                <div class="tab">
                    <ul id="tabnav">
                        <li  class="selected">猜你想看的文章</li>
                    </ul>
                </div>
                <div class="clear"></div>
            </div>
            <div id="tab-content">
                <ul>
                                                <li><a href="http://www.pcgg.com.cn/lol/16279.html">lol战绩保留多久</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/lol/23299.html">lol游戏视角怎么跟随人物</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/gl/3447.html">原神绿剑给谁好</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/lol/18581.html">英雄联盟哪个位置难度最高</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/xyx/34390.html">广州4399手机游戏</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/lol/22745.html">s4lol几几年</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/gl/954.html">原神牛奶在哪买</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/lol/16845.html">lol多少个皮肤碎片换一个皮肤</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/lol/20720.html">lol周年庆几月几</a></li>
                                                    <li><a href="http://www.pcgg.com.cn/lol/30350.html">英雄联盟啥时候发框</a></li>
                                        </ul>
            </div>
        </div>
        									</div>
	</div>
</div>
</div>
<div class="clear"></div>
<div id="footer">
<div class="container">
	<div class="twothird">
      </div>

</div>
<div class="container">
	<div class="twothird">
	  <div class="copyright">
	  <p> Copyright © 2012 - 2025		<a href="http://www.csdndoc.com/"><strong>程序员档案馆</strong></a> Powered by <a href="/lists">网站分类目录</a> | <a href="/top100.php" target="_blank">精选推荐文章</a> | <a href="/sitemap.xml" target="_blank">网站地图</a>  | <a href="/post/" target="_blank">疑难解答</a>

				<a href="https://beian.miit.gov.cn/" rel="external">京ICP备05034492号</a>
		 	  </p>
	  <p>声明:本站内容来自互联网,如信息有错误可发邮件到f_fb#foxmail.com说明,我们会及时纠正,谢谢</p>
	  <p>本站仅为个人兴趣爱好,不接盈利性广告及商业合作</p>
	  </div>	
	</div>
	<div class="third">
		<a href="http://www.xiaoboy.cn" target="_blank">小男孩</a>			
	</div>
</div>
</div>
<!--gototop-->
<div id="tbox">
    <a id="home" href="http://www.csdndoc.com" title="返回首页"><i class="fa fa-gohome"></i></a>
      <a id="pinglun" href="#comments_box" title="前往评论"><i class="fa fa-commenting"></i></a>
   
  <a id="gotop" href="javascript:void(0)" title="返回顶部"><i class="fa fa-chevron-up"></i></a>
</div>
<script src="//css.5d.ink/body5.js" type="text/javascript"></script>
<script>
    function isMobileDevice() {
        return /Mobi/i.test(navigator.userAgent) || /Android/i.test(navigator.userAgent) || /iPhone|iPad|iPod/i.test(navigator.userAgent) || /Windows Phone/i.test(navigator.userAgent);
    }
    // 加载对应的 JavaScript 文件
    if (isMobileDevice()) {
        var script = document.createElement('script');
        script.src = '//css.5d.ink/js/menu.js';
        script.type = 'text/javascript';
        document.getElementsByTagName('head')[0].appendChild(script);
    }
</script>
<script>
$(document).ready(function() { 
 $("#sidebar-follow").pin({
      containerSelector: ".main-container",
	  padding: {top:64},
	  minWidth: 768
	}); 
 $(".mainmenu").pin({
	 containerSelector: ".container",
	  padding: {top:0}
	});
 $(".swipebox").swipebox();	
});
</script>

 </body></html>