突破反爬壁垒：selenium动态爬取某宝商品评论【附源码】_淘宝商品评论爬虫

技术文档

一、淘宝反爬机制解析

淘宝采用多层次反爬策略，主要包含：

流量特征检测：识别异常请求频率
WebDriver检测：检测自动化浏览器特征
行为模式分析：验证鼠标移动轨迹
动态参数加密：关键请求携带加密令牌

二、反爬破解技术方案

2.1 核心工具选型

# 必要库安装pip install selenium undetected-chromedriver fake_useragent pandas

工具作用版本要求 undetected-chromedriver 隐藏自动化特征 ≥3.1.6 selenium 浏览器自动化控制 ≥4.0 fake_useragent 生成随机请求头 ≥1.1.1

2.2 库的作用解析

库/模块作用 selenium 自动化控制浏览器，模拟真实用户操作 fake_useragent 生成随机的浏览器User-Agent，伪装不同设备访问 WebDriverWait 智能等待页面元素加载，避免因网络延迟导致的元素找不到错误 pandas 数据处理，最终将结果保存为Excel

三、反爬逻辑

3.1、核心类解析 - `TaobaoCommentCrawler`

3.1.1 初始化方法 `init`

def __init__(self): self.ua = UserAgent() # 创建随机UserAgent生成器 self.options = self._get_browser_options() # 获取浏览器配置 # 配置ChromeDriver路径(更换为自己的路劲) chromedriver_path = r\'D:\\soft\\chromedriver\\chromedriver-win64\\chromedriver.exe\' self.service = Service(executable_path=chromedriver_path) # 创建Service实例 self.driver = webdriver.Chrome(service=self.service, options=self.options) # 传入service参数 self.wait = WebDriverWait(self.driver, 45)

关键点：

UserAgent() 会随机生成如 Mozilla/5.0 (Windows NT 10.0; Win64; x64)... 的请求头
webdriver.Chrome 启动了真实浏览器实例（需安装Chrome浏览器）

3.1.2 浏览器配置 `_get_browser_options`

def _get_browser_options(self): \"\"\"配置反检测浏览器选项\"\"\" options = webdriver.ChromeOptions() # 关键反爬配置 ↓ options.add_argument(f\'user-agent={self.ua.random}\') # 随机UA options.add_argument(\'--disable-blink-features=AutomationControlled\') # 隐藏自动化特征 options.add_experimental_option(\"excludeSwitches\", [\"enable-automation\"]) # 移除自动化提示 options.add_experimental_option(\'useAutomationExtension\', False) # 禁用自动化扩展 return options

反爬原理：

配置项作用 user-agent 让服务器认为请求来自不同浏览器/设备 disable-blink-features 隐藏浏览器自动化特征（如navigator.webdriver属性） excludeSwitches 移除Chrome开发者模式提示（如\"Chrome正受到自动测试软件控制\"）

3.1.3 人类行为模拟 `_human_like_operation`

def _human_like_operation(self): \"\"\"模拟人类操作行为\"\"\" # 随机滚动页面 scroll_height = random.randint(500, 1000) self.driver.execute_script(f\"window.scrollTo(0, {scroll_height})\") # 随机停留时间 time.sleep(random.uniform(1.5, 3.5))

反爬价值：

滚动页面：触发懒加载内容，同时产生真实用户行为轨迹
随机延迟：打破固定时间间隔的机器访问特征

3.2 核心爬取逻辑

关键技术点：

3.2.1 智能元素定位与等待策略

（1）双重定位容错机制：

EC.any_of( EC.element_to_be_clickable((\"查看全部\"XPath)), EC.element_to_be_clickable((\"宝贝评价\"XPath)))

技术价值：通过EC.any_of同时监控多个可能出现的元素，兼容不同页面版本（如：有的商品显示\"查看全部\"，有的显示\"宝贝评价\"）
实现原理：轮询检查多个定位器，只要任一元素出现就立即返回

（2）双重定位容错机制：

WebDriverWait(self.driver, 45).until(...)

技术价值：解决因网络延迟或页面加载慢导致的元素定位失败
参数说明：45秒超时时间，适应不同性能环境

3.2.2 动态内容处理与防失效机制

（1）元素防失效策略：

self.driver.execute_script(\"arguments[0].scrollIntoView(...);\", element)

技术价值：通过JavaScript直接控制浏览器滚动，确保元素在可视区域
参数优化：{block: \'center\'}使元素滚动到视窗中央，避免被悬浮层遮挡

（2）动态元素重获取:

items = self.driver.find_elements(...)for item in items: # 每次操作前重新获取元素

解决痛点：避免因页面刷新导致的StaleElementReferenceException

三、完整代码

import timeimport randomimport pandas as pdfrom selenium import webdriverfrom selenium.common import TimeoutExceptionfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom fake_useragent import UserAgentclass TaobaoCommentCrawler: def __init__(self): self.ua = UserAgent() # 创建随机UserAgent生成器 self.options = self._get_browser_options() # 获取浏览器配置 # 配置ChromeDriver路径(更换为自己的路径) chromedriver_path = r\'D:\\soft\\chromedriver\\chromedriver-win64\\chromedriver.exe\' self.service = Service(executable_path=chromedriver_path) # 创建Service实例 self.driver = webdriver.Chrome(service=self.service, options=self.options) # 传入service参数 self.wait = WebDriverWait(self.driver, 45) def _get_browser_options(self): \"\"\"配置反检测浏览器选项\"\"\" options = webdriver.ChromeOptions() # 关键反爬配置 ↓ options.add_argument(f\'user-agent={self.ua.random}\') # 随机UA options.add_argument(\'--disable-blink-features=AutomationControlled\') # 隐藏自动化特征 options.add_experimental_option(\"excludeSwitches\", [\"enable-automation\"]) # 移除自动化提示 options.add_experimental_option(\'useAutomationExtension\', False) # 禁用自动化扩展 return options def _human_like_operation(self): \"\"\"模拟人类操作行为\"\"\" # 随机滚动页面 scroll_height = random.randint(500, 1000) self.driver.execute_script(f\"window.scrollTo(0, {scroll_height})\") # 随机停留时间 time.sleep(random.uniform(1.5, 3.5)) def get_comments(self, item_id, max_pages=3): \"\"\"获取新版淘宝商品评论\"\"\" comments = [] url = f\'https://item.taobao.com/item.htm?id={item_id}\' print(f\"正在访问商品页面: {url}\") self.driver.get(url) try: # 智能点击逻辑 try: # 同时等待两种可能的元素（最多等待15秒） element = WebDriverWait(self.driver, 45).until(  EC.any_of( EC.element_to_be_clickable( (By.XPATH, \"//div[contains(@style, \'color: rgb(255, 80, 0)\')]//span[contains(text(), \'查看全部\')]\") ), EC.element_to_be_clickable( (By.XPATH, \"//div[contains(@class, \'Tabs--title\')]//span[contains(text(), \'宝贝评价\')]\") )  ) ) print(1) # 滚动到元素可见 self.driver.execute_script(\"arguments[0].scrollIntoView({block: \'center\'});\", element) time.sleep(1) # 执行点击 element.click() print(f\"已成功点击：{element.text}\") except TimeoutException: raise Exception(\"未找到任何可点击的评论标签，请检查页面结构\") # 等待评论容器加载 self.wait.until( EC.presence_of_element_located((By.CSS_SELECTOR, \'.Comments--comments--1662-Lt\')) ) print(\"评论容器加载完成\") # 自动展开评价 self.driver.execute_script(\"\"\" document.querySelector(\'a[href*=\"#feedback\"]\').click(); \"\"\") time.sleep(2) # 处理分页 current_page = 1 while current_page  1 else None, \'useful\': item.find_element(By.CSS_SELECTOR, \'.Comment--like--1swbsLo span\').text, \'visited\': item.find_element(By.CSS_SELECTOR, \'.Comment--visited--2t0QSw-\').text } comments.append(comment)  except Exception as e: print(f\"评论解析异常: {str(e)}\") continue # 翻页处理 try:  next_btn = self.driver.find_element(By.CSS_SELECTOR, \'.next-btn:not(.disabled)\')  if next_btn: self.driver.execute_script(\"arguments[0].scrollIntoView();\", next_btn) next_btn.click() print(\"已点击下一页\") current_page += 1 time.sleep(3)  else: print(\"没有更多页面\") break except Exception as e:  print(f\"翻页失败: {str(e)}\")  break # 人类行为模拟 self._human_like_operation() except Exception as e: print(f\"爬取中断: {str(e)}\") finally: self.driver.quit() return pd.DataFrame(comments)# 使用示例if __name__ == \'__main__\': crawler = TaobaoCommentCrawler() df = crawler.get_comments(\'833444005595\', max_pages=2) # 替换实际商品ID df.to_excel(\'淘宝评论.xlsx\', index=False)

四、优化建议与延伸学习建议

性能优化方向：

使用住宅代理IP池（推荐Luminati）
集成深度学习验证码识别模型
采用分布式架构提升吞吐量

注意事项：

本代码需在遵守淘宝《服务条款》前提下使用
商业场景请使用官方API（OpenAPI）
控制请求频率（建议≤30次/分钟）

延伸学习建议：

Python爬虫从基础可参考本人前两篇技术笔记：

每日实战：Python爬取微博热榜数据存入Excel_爬取每日热榜-CSDN博客
每日实战：python爬虫之网页跳转-以某博为例_python 爬虫跳转网页-CSDN博客
Python爬取亚马逊商品数据-多线程【附源码】-CSDN博客

原创声明：本技术方案为实战经验总结，转载请注明出处。数据抓取行为可能违反平台政策，请谨慎使用！更多技术细节欢迎关注CSDN技术专栏

注意事项：本文所涉爬虫技术仅作学习交流，禁止用于任何商业或非法用途。实际操作请遵守《网络安全法》及相关平台规则。

突破反爬壁垒：selenium动态爬取某宝商品评论【附源码】_淘宝商品评论爬虫

一、淘宝反爬机制解析

二、反爬破解技术方案

2.1 核心工具选型

2.2 库的作用解析

三、反爬逻辑

3.1、核心类解析 - `TaobaoCommentCrawler`

3.1.1 初始化方法 `init`

3.1.2 浏览器配置 `_get_browser_options`

3.1.3 人类行为模拟 `_human_like_operation`

3.2 核心爬取逻辑

关键技术点：

3.2.1 智能元素定位与等待策略

3.2.2 动态内容处理与防失效机制

三、完整代码

四、优化建议与延伸学习建议

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

突破反爬壁垒：selenium动态爬取某宝商品评论【附源码】_淘宝商品评论爬虫

一、淘宝反爬机制解析

二、反爬破解技术方案

2.1 核心工具选型

2.2 库的作用解析

三、反爬逻辑

3.1、核心类解析 - TaobaoCommentCrawler

3.1.1 初始化方法 __init__

3.1.2 浏览器配置 _get_browser_options

3.1.3 人类行为模拟 _human_like_operation

3.2 核心爬取逻辑

关键技术点：

3.2.1 智能元素定位与等待策略

3.2.2 动态内容处理与防失效机制

三、完整代码

四、优化建议与延伸学习建议

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

3.1、核心类解析 - `TaobaoCommentCrawler`

3.1.1 初始化方法 `init`

3.1.2 浏览器配置 `_get_browser_options`

3.1.3 人类行为模拟 `_human_like_operation`