Python + Selenium 自动化爬取途牛动态网页

技术文档

1. 引言

在互联网数据采集领域，动态网页（即通过JavaScript异步加载数据的网页）的爬取一直是一个挑战。传统的**requests**+**BeautifulSoup**组合适用于静态页面，但对于动态渲染的内容（如途牛旅游网的酒店、景点、评论等）则难以直接获取。

Selenium 是一个强大的浏览器自动化工具，可以模拟用户操作（如点击、滚动、输入等），并获取动态渲染后的完整HTML。本文将详细介绍如何使用 Python + Selenium 自动化爬取途牛旅游网的动态数据，并提供完整的代码实现。

2. 环境准备

在开始之前，我们需要安装必要的Python库：

此外，Selenium需要浏览器驱动（如ChromeDriver）。请确保已安装 Chrome浏览器，并下载对应版本的 ChromeDriver（下载地址）。

3. Selenium基础操作

3.1 初始化浏览器驱动

from selenium import webdriverfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysimport time# 配置ChromeDriver路径driver_path = \"你的ChromeDriver路径\" # 例如：/usr/local/bin/chromedriverservice = Service(driver_path)# 启动浏览器（无头模式可选）options = webdriver.ChromeOptions()options.add_argument(\'--headless\') # 无头模式，不显示浏览器窗口driver = webdriver.Chrome(service=service, options=options)

3.2 访问网页并等待加载

url = \"https://www.tuniu.com/\"driver.get(url)time.sleep(3) # 等待页面加载

3.3 查找元素并交互

Selenium提供多种元素定位方式：

**find_element(By.ID, \"id\")**
**find_element(By.CLASS_NAME, \"class\")**
**find_element(By.XPATH, \"xpath\")**

例如，搜索“北京”旅游线路：

search_box = driver.find_element(By.ID, \"search-input\")search_box.send_keys(\"北京\")search_box.send_keys(Keys.RETURN) # 模拟回车time.sleep(5) # 等待搜索结果加载

4. 爬取途牛旅游数据实战

4.1 目标分析

假设我们要爬取途牛旅游网的 热门旅游线路，包括：

线路名称
价格
出发地
行程天数
用户评分

4.2 获取动态渲染的HTML

由于途牛的数据是动态加载的，直接**requests.get()**无法获取完整HTML。使用Selenium获取渲染后的页面：

tifulSoup）

from bs4 import BeautifulSoupimport pandas as pdsoup = BeautifulSoup(html, \'html.parser\')tours = []for item in soup.select(\'.trip-item\'): # 根据实际HTML结构调整选择器 name = item.select_one(\'.title\').text.strip() price = item.select_one(\'.price\').text.strip() departure = item.select_one(\'.departure\').text.strip() days = item.select_one(\'.days\').text.strip() rating = item.select_one(\'.rating\').text.strip() tours.append({ \'name\': name, \'price\': price, \'departure\': departure, \'days\': days, \'rating\': rating })# 存储为DataFramedf = pd.DataFrame(tours)print(df.head())

4.4 翻页爬取

途牛旅游数据通常是分页加载的，我们可以模拟点击“下一页”：

while True: try: next_page = driver.find_element(By.CSS_SELECTOR, \'.next-page\') next_page.click() time.sleep(3) # 等待新页面加载 html = driver.page_source # 继续解析... except: break # 没有下一页时退出

5. 反爬策略应对

途牛可能会检测Selenium爬虫，常见的反反爬措施：

修改User-Agent

禁用自动化标志

使用代理IP

随机等待时间

6. 完整代码示例

from selenium import webdriverfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysfrom bs4 import BeautifulSoupimport pandas as pdimport timeimport random# 代理配置proxyHost = \"www.16yun.cn\"proxyPort = \"5445\"proxyUser = \"16QMSOML\"proxyPass = \"280651\"# 初始化浏览器driver_path = \"你的ChromeDriver路径\"service = Service(driver_path)options = webdriver.ChromeOptions()# 设置代理proxy_options = f\"--proxy-server=http://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}\"options.add_argument(proxy_options)# 其他选项options.add_argument(\'--headless\') # 无头模式options.add_argument(\'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36\')# 绕过代理认证弹窗（如果需要）options.add_argument(\'--proxy-bypass-list=*\')options.add_argument(\'--ignore-certificate-errors\')driver = webdriver.Chrome(service=service, options=options)# 访问途牛旅游网url = \"https://www.tuniu.com/\"driver.get(url)time.sleep(3)# 搜索\"北京\"旅游线路search_box = driver.find_element(By.ID, \"search-input\")search_box.send_keys(\"北京\")search_box.send_keys(Keys.RETURN)time.sleep(5)# 爬取多页数据tours = []for _ in range(3): # 爬取3页 html = driver.page_source soup = BeautifulSoup(html, \'html.parser\') for item in soup.select(\'.trip-item\'): name = item.select_one(\'.title\').text.strip() price = item.select_one(\'.price\').text.strip() departure = item.select_one(\'.departure\').text.strip() days = item.select_one(\'.days\').text.strip() rating = item.select_one(\'.rating\').text.strip() tours.append({ \'name\': name, \'price\': price, \'departure\': departure, \'days\': days, \'rating\': rating }) # 翻页 try: next_page = driver.find_element(By.CSS_SELECTOR, \'.next-page\') next_page.click() time.sleep(random.uniform(2, 5)) except: break# 存储数据df = pd.DataFrame(tours)df.to_csv(\'tuniu_tours.csv\', index=False, encoding=\'utf-8-sig\')# 关闭浏览器driver.quit()print(\"数据爬取完成，已保存至 tuniu_tours.csv\")

7. 总结

本文介绍了如何使用 Python + Selenium 自动化爬取途牛旅游网的动态数据，包括：

Selenium基础操作（启动浏览器、查找元素、模拟点击）
动态页面解析（结合BeautifulSoup提取数据）
翻页爬取（自动点击“下一页”）
反爬策略（User-Agent、代理IP、随机等待）

Selenium虽然强大，但速度较慢，适合小规模爬取。如需更高效率，可研究 Playwright 或 Scrapy + Splash 方案。

Python + Selenium 自动化爬取途牛动态网页

1. 引言

2. 环境准备

3. Selenium基础操作

3.1 初始化浏览器驱动

3.2 访问网页并等待加载

3.3 查找元素并交互

4. 爬取途牛旅游数据实战

4.1 目标分析

4.2 获取动态渲染的HTML

4.4 翻页爬取

5. 反爬策略应对

6. 完整代码示例

7. 总结

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

Python + Selenium 自动化爬取途牛动态网页

1. 引言

2. 环境准备

3. Selenium基础操作

3.1 初始化浏览器驱动

3.2 访问网页并等待加载

3.3 查找元素并交互

4. 爬取途牛旅游数据实战

4.1 目标分析

4.2 获取动态渲染的HTML

4.4 翻页爬取

5. 反爬策略应对

6. 完整代码示例

7. 总结

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签