BeautifulSoup的基本使用

文档中心

✅作者简介：大家好我是hacker707,大家可以叫我hacker
?个人主页：hacker707的csdn博客
?系列专栏：python爬虫
?如果觉得博主的文章还不错的话，请?三连支持一下博主哦?

在这里插入图片描述

bs4

bs4的安装

bs4的快速入门
解析器的比较(了解即可)
对象种类

bs4的简单使用

遍历文档树

案例练习
思路
代码实现

bs4的安装

要使用BeautifulSoup4需要先安装lxml,再安装bs4

pip install lxml

pip install bs4

使用方法：

from bs4 import BeautifulSoup

lxml和bs4对比学习

from lxml import etreetree = etree.HTML(html)tree.xpath()

from bs4 import BeautifulSoupsoup =  BeautifulSoup(html_doc, 'lxml')

注意事项：
创建soup对象时如果不传’lxml’或者features="lxml"会出现以下警告
在这里插入图片描述

bs4的快速入门

解析器的比较(了解即可)

解析器	用法	优点	缺点
python标准库	BeautifulSoup(markup,‘html.parser’)	python标准库，执行速度适中	(在python2.7.3或3.2.2之前的版本中)文档容错能力差
lxml的HTML解析器	BeautifulSoup(markup,‘lxml’)	速度快，文档容错能力强	需要安装c语言库
lxml的XML解析器	BeautifulSoup(markup,‘lxml-xml’)或者BeautifulSoup(markup,‘xml’)	速度快，唯一支持XML的解析器	需要安装c语言库
html5lib	BeautifulSoup(markup,‘html5lib’)	最好的容错性，以浏览器的方式解析文档，生成HTML5格式的文档	速度慢，不依赖外部扩展

对象种类

Tag：标签
BeautifulSoup：bs对象
NavigableString：可导航的字符串
Comment：注释

from bs4 import BeautifulSoup# 创建模拟HTML代码的字符串html_doc = """The Dormouse's storyThe Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...
"""# 创建soup对象soup = BeautifulSoup(html_doc, 'lxml')print(type(soup.title))  # print(type(soup))  # print(type(soup.title.string))  # print(type(soup.span.string))  #

bs4的简单使用

获取标签内容

from bs4 import BeautifulSoup# 创建模拟HTML代码的字符串html_doc = """The Dormouse's storyThe Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...
"""# 创建soup对象soup = BeautifulSoup(html_doc, 'lxml')print('head标签内容:\n', soup.head)  # 打印head标签print('body标签内容:\n', soup.body)  # 打印body标签print('html标签内容:\n', soup.html)  # 打印html标签print('p标签内容:\n', soup.p)  # 打印p标签

✅注意：在打印p标签对应的代码时，可以发现只打印了第一个p标签内容，这时我们可以通过find_all来获取p标签全部内容

print('p标签内容:\n', soup.find_all('p'))

✅这里需要注意使用find_all里面必须传入的是字符串
获取标签名字
通过name属性获取标签名字

from bs4 import BeautifulSoup# 创建模拟HTML代码的字符串html_doc = """The Dormouse's storyThe Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...
"""# 创建soup对象soup = BeautifulSoup(html_doc, 'lxml')print('head标签名字:\n', soup.head.name)  # 打印head标签名字print('body标签名字:\n', soup.body.name)  # 打印body标签名字print('html标签名字:\n', soup.html.name)  # 打印html标签名字print('p标签名字:\n', soup.find_all('p').name)  # 打印p标签名字

✅如果要找到两个标签的内容，需要传入列表过滤器，而不是字符串过滤器
使用字符串过滤器获取多个标签内容会返回空列表

print(soup.find_all('title', 'p'))

[]

需要使用列表过滤器获取多个标签内容

print(soup.find_all(['title', 'p']))

[<title>The Dormouse's story, The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>, <p class="story">...</p>]

获取a标签的href属性值

from bs4 import BeautifulSoup# 创建模拟HTML代码的字符串html_doc = """The Dormouse's storyThe Dormouse's story Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well. ... """# 创建soup对象soup = BeautifulSoup(html_doc, 'lxml')a_list = soup.find_all('a')# 遍历列表取属性值for a in a_list: # 第一种方法通过get去获取href属性值(没有找到返回None) print(a.get('href')) # 第二种方法先通过attrs获取所有属性值，再提取出你想要的属性值 print(a.attrs['href']) # 第三种方法获取没有的属性值会报错 print(a['href'])

✅扩展：使用prettify()美化让节点层级关系更加明显方便分析

print(soup.prettify())

不使用prettify时的代码

<html><head><title>The Dormouse's story</title></head><body>The Dormouse's storyOnce upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well....</body></html>

使用prettify时的代码

<html> <head> <title> The Dormouse's story </title> </head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ;and they lived at the bottom of a well. ... </body></html>

遍历文档树

from bs4 import BeautifulSoup# 创建模拟HTML代码的字符串html_doc = """The Dormouse's storyThe Dormouse's story Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well. ... """soup = BeautifulSoup(html_doc, 'lxml')head = soup.head# contents返回的是所有子节点的列表 [The Dormouse's story]print(head.contents)# children返回的是一个子节点的迭代器 print(head.children)# 凡是迭代器都是可以遍历的for h in head.children: print(h)html = soup.html # 会把换行也当作子节点匹配到# descendants 返回的是一个生成器遍历子子孙孙 print(html.descendants)# 凡是生成器都是可遍历的for h in html.descendants: print(h)'''需要重点掌握的string获取标签里面的内容strings 返回是一个生成器对象用过来获取多个标签内容stripped_strings 和strings基本一致但是它可以把多余的空格去掉'''print(soup.title.string)print(soup.html.string)# 返回生成器对象# soup.html.strings 包含在html标签里面的文本都会被获取到print(soup.html.strings)for h in soup.html.strings: print(h)# stripped_strings可以把多余的空格去掉# 返回生成器对象print(soup.html.stripped_strings)for h in soup.html.stripped_strings: print(h)'''parent直接获得父节点parents获取所有的父节点'''title = soup.title# parent找直接父节点print(title.parent)# parents获取所有父节点# 返回生成器对象print(title.parents)for p in title.parents: print(p)# html的父节点就是整个文档print(soup.html.parent)# print(type(soup.html.parent))

案例练习

获取所有职位名称

html = """ 职位名称职位类别人数地点发布时间 22989-金融云区块链高级研发工程师（深圳）技术类 1 深圳 2017-11-25 22989-金融云高级后台开发技术类 2 深圳 2017-11-25 SNG16-腾讯音乐运营开发工程师（深圳）技术类 2 深圳 2017-11-25 SNG16-腾讯音乐业务运维工程师（深圳）技术类 1 深圳 2017-11-25 TEG03-高级研发工程师（深圳）技术类 1 深圳 2017-11-24 TEG03-高级图像算法研发工程师（深圳）技术类 1 深圳 2017-11-24 TEG11-高级AI开发工程师（深圳）技术类 4 深圳 2017-11-24 15851-后台开发工程师技术类 1 深圳 2017-11-24 15851-后台开发工程师技术类 1 深圳 2017-11-24 SNG11-高级业务运维工程师（深圳）技术类 1 深圳 2017-11-24 """

思路

不难看出想要的数据在tr节点的a标签里，只需要遍历所有的tr节点，从遍历出来的tr节点取a标签里面的文本数据

代码实现

from bs4 import BeautifulSouphtml = """ 职位名称职位类别人数地点发布时间 22989-金融云区块链高级研发工程师（深圳）技术类 1 深圳 2017-11-25 22989-金融云高级后台开发技术类 2 深圳 2017-11-25 SNG16-腾讯音乐运营开发工程师（深圳）技术类 2 深圳 2017-11-25 SNG16-腾讯音乐业务运维工程师（深圳）技术类 1 深圳 2017-11-25 TEG03-高级研发工程师（深圳）技术类 1 深圳 2017-11-24 TEG03-高级图像算法研发工程师（深圳）技术类 1 深圳 2017-11-24 TEG11-高级AI开发工程师（深圳）技术类 4 深圳 2017-11-24 15851-后台开发工程师技术类 1 深圳 2017-11-24 15851-后台开发工程师技术类 1 深圳 2017-11-24 SNG11-高级业务运维工程师（深圳）技术类 1 深圳 2017-11-24 """# 创建soup对象soup = BeautifulSoup(html, 'lxml')# 使用find_all()找到所有的tr节点(经过观察第一个tr节点为表头,忽略不计)tr_list = soup.find_all('tr')[1:]# 遍历tr_list取a标签里的文本数据for tr in tr_list: a_list = tr.find_all('a') print(a_list[0].string)

运行结果如下：

22989-金融云区块链高级研发工程师（深圳）22989-金融云高级后台开发SNG16-腾讯音乐运营开发工程师（深圳）SNG16-腾讯音乐业务运维工程师（深圳）TEG03-高级研发工程师（深圳）TEG03-高级图像算法研发工程师（深圳）TEG11-高级AI开发工程师（深圳）15851-后台开发工程师15851-后台开发工程师SNG11-高级业务运维工程师（深圳）

?以上就是bs4的基本使用，如果有改进的建议，欢迎在评论区留言奥~

这篇文章参加了csdn的活动，还请大家多多三连支持一下博主，你们的支持就是我创作的动力?

CSDN社区《创作达人》活动，只要参与其中并创作文章就有机会获得官方奖品：精品日历、新程序员杂志，快来参与吧！链接直达 https://bbs.csdn.net/topics/605272551

医学名词百科

网络标签：工程师标签深圳

上一篇
Cookie和Session的区别与联系

下一篇
【PHP】Composer速度慢？使用Packagist 中国全量镜像加速（保姆级图文）

相关问题

从mongodb中取出一个可能会缺少key的Array数据类型的数据，生成一个自定义类的列表

猿创征文｜Redis事务问题

五鼠闹东京电视剧关于五鼠闹东京电视剧的介绍

找不到mfc110.dll,无法执行代码

交通违章处理时间限制吗

Vision Transformer(ViT)

pytest测试框架结合allure生成精美测试报告

【C语言】指针和数组的深入理解（第三期）

网游游戏角色名字大全霸气（网游角色名称）

什么是程序连接

BeautifulSoup的基本使用

bs4

bs4的安装

bs4的快速入门

解析器的比较(了解即可)

对象种类

bs4的简单使用

遍历文档树

案例练习

思路

代码实现

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

职位名称	职位类别	人数	地点	发布时间
22989-金融云区块链高级研发工程师（深圳）	技术类	1	深圳	2017-11-25
22989-金融云高级后台开发	技术类	2	深圳	2017-11-25
SNG16-腾讯音乐运营开发工程师（深圳）	技术类	2	深圳	2017-11-25
SNG16-腾讯音乐业务运维工程师（深圳）	技术类	1	深圳	2017-11-25
TEG03-高级研发工程师（深圳）	技术类	1	深圳	2017-11-24
TEG03-高级图像算法研发工程师（深圳）	技术类	1	深圳	2017-11-24
TEG11-高级AI开发工程师（深圳）	技术类	4	深圳	2017-11-24
15851-后台开发工程师	技术类	1	深圳	2017-11-24
15851-后台开发工程师	技术类	1	深圳	2017-11-24
SNG11-高级业务运维工程师（深圳）	技术类	1	深圳	2017-11-24