beautifulsoup4 基础

Fur1na

2023-09-15

bs4

安装

1	pip3 install beautifulsoup4

导入包

1	from bs4 import BeautifulSoup

将文本内容解析为 BeautifulSoup 对象

解析器

设置 html.parser 可以将文本解析为 HTML 对象，当然也可以不设置，bs4 会默认选择一个适合的解析器，但最好指定，不同的解析器之前存在一些差异（https://beautifulsoup.readthedocs.io/zh-cn/v4.4.0/#id53）。

1	soup = BeautifulSoup(html_doc, 'html.parser')

bs4 支持第三方的解析器，常见的如 lxml，需要单独安装。

1	pip3 install lxml

html.parser 使用的是 Python 标准库的 HTML 解析器，建议在 Python 3.2.2 以上版本使用，低版本兼容性较差。lxml 速度快、容错强，但需要单独安装。Python 高版本都可以使用，但 lxml 的效率更高。

对象种类

通过解析器，bs4 将一段文本解析为复杂的树形结构，每一部分都是一个对象，主要分为四种类型：Tag、NavigableString、BeautifulSoup、Comment。

Tag

Name

使用 .name 可以获取 Tag 的名称

Attributes

一个 Tag 可能有多个属性，可以通过 .attrs 获取其属性，其中操作属性的方式与字典一样，可以被添加、删除和修改。

# 新增/修改
tag['class'] = 'verybold'
# 删除
del tag['id']

有一些属性可能包含多个值，例如 class，那 bs4 会返回一个列表。

NavigableString

字符串常常被包含在 tag 里面，常见的就是 title。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
"""

获取标题名称。

soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find('p')
print(tag.string)
# The Dormouse's story

BeautifulSoup

BeautifulSoup 大部分表示的是一个文档全部的内容。支持遍历文档树、搜索文档树中的大部分内容。

Comment

Comment 主要处理文档中的注释部分。

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

遍历文档树

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story strikeout">...</p>
"""

子节点

通过一个 tag 获取其子节点最常见的是使用 name 确定所需的标签。

找到 head 标签

1
2
3

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find('head'))
# <head><title>The Dormouse's story</title></head>

也可以直接使用 . 获取。

1
2
3

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.head)
# <head><title>The Dormouse's story</title></head>

同理获取 p 标签

1 2	soup = BeautifulSoup(html_doc, 'html.parser') print(soup.p)

但这种方式只会获取第一个检索到的标签，如果想获取所有的标签可以使用 find_all()

1 2	soup = BeautifulSoup(html_doc, 'html.parser') print(soup.find_all('p'))

这样就会返回一个列表。

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story strikeout">...</p>]

tag 还有两个属性 .contents 和 .children

.contents

1
2
3

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.p.contents)
# [<b>The Dormouse's story</b>]

.children 可以对子节点进行遍历。

soup = BeautifulSoup(html_doc, 'html.parser')
for child in soup.p.children:
    print(child)
    # <b>The Dormouse's story</b>

搜索文档树

过滤器

使用字符串定位标签 p

1 2	soup.find_all('b') # [<b>The Dormouse's story</b>]

正则表达式定位以 b 开头的标签

1
2
3

soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

搜索名称中带 t 的标签。

for tag in soup.find_all(re.compile("t")):
    print(tag.name)
# html
# title

使用列表寻找多个标签

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

使用布尔值查找

1 2	for tag in soup.find_all(True): print(tag.name)

如果没有合适的过滤器，可以自定义一个方法（这个方法只能接受一个参数），寻找包含 class 属性但是不包含 id 的标签。

1 2	def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id')

将这个方法放入 fand_all 中将得到所有符合条件的标签。

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>

find_all()

find_all 方法会搜索所有的 tag 子节点，并判断是否满足过滤器条件。

name

name 方法和上面过滤器中方法一致，可以传入字符串、正则表达式、列表等。

keyword

如果指定了一个值，并不是 bs4 内置的参数，就会把它当做搜索的对象。

id 并不是内置参数，因此代码的作用是寻找 tag 里面的 id 属性为 link2 的标签。

1 2	soup.find_all(id='link2') # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

同理

1 2	soup.find_all(href=re.compile("elsie")) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

无论参数是什么，都可以使用前面说过的过滤器。

1 2	soup.find_all(href=re.compile("elsie"), id='link1') # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

有些特殊的属性无法搜索，如 data-*，因为 find_all(data-foo = "value") 不符合 Python 的语法。因此可以使用 attrs 字典的方式搜索。

1 2	data_soup.find_all(attrs={"data-foo": "value"}) # [<div data-foo="value">foo!</div>]

也可以通过限定属性，找到合适的标签。这里就是寻找 input 标签中 id 为 title_zh 的 value 值。

1	soup.find('input', {'id': 'title_zh'})['value'].strip()

string

string 参数可以搜索文档中的字符串。

1 2	soup = BeautifulSoup(html_doc, 'html.parser') print(soup.find_all(string=["Tillie", "Elsie", "Lacie"]))

1 2	soup = BeautifulSoup(html_doc, 'html.parser') print(soup.find_all(string=re.compile("Dormouse")))

limit

限制查询数量

1
2
3

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

find_all() 是 bs4 里面最常用的搜索方法，因此有它的简化写法，可以更方便的使用，下面的代码都是等价的。

1 2	soup.find_all("a") soup("a")

1 2	soup.title.find_all(string=True) soup.title(string=True)

find()

只想获得一个结果的时候，这两行内容是等价的，完全可以使用 find，但 find_all 未找到返回的是空列表，而 find 会返回 None

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>