python模块详解 | BeautifulSoup

2021-05-19

tec

python

前言

Why BeautifulSoup?

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐使用lxml 解析器。

使用

from bs4 import BeautifulSoup

1. 创建BeautifulSoup对象的三种方式

导入字符串 - soup = BeatifulSoup(str)
导入文件 - soup = BeautifulSoup(open('index.html'))

网络加载导入 - soup = BeautifulSoup(requests.get(url).text)

import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.baidu.com')
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text)
print(soup.prettify()) # 格式化输出

2. 查

beautifulsoup提供了很多可用于查找元素的方法，在此只列举使用频率较高的几个方法：

find(name, attrs, recursive, text)
- :param name: A filter on tag name.
- :param attrs: A dictionary of filters on attribute values.
- :param recursive: If this is True, find() will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
- :param limit: Stop looking after finding this many results.
- :kwargs: A dictionary of filters on attribute values.
- :return: A PageElement.
```
import requests
from bs4 import BeautifulSoup
url = "https://www.dytt8.net/index.htm"
r = requests.get(url)
r.encoding = 'gbk'
doc = r.text
soup = BeautifulSoup(doc, 'lxml')
contents = soup.find("div", attrs={"class": "bd3r"})
```

find_all(name, attrs, recursive, text, limit)

import requests
from bs4 import BeautifulSoup
url = "https://www.dytt8.net/index.htm"
r = requests.get(url)
r.encoding = 'gbk'
doc = r.text
soup = BeautifulSoup(doc, 'lxml')
contents = soup.find_all("td")

select(selector, namespaces=None, limit=None, **kwargs)

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list，示例：

soup.select("div[class='first']") - 查找class=“first"的div元素
soup.select("div p") - 查找div下的所有子孙元素p
soup.select("div > p") - 查找div下的所有直接子元素p
soup.select("div ~ p") - 查找div后面的所有同级别的兄弟元素p
soup.select("div + p") - 查找div后面的第一个同级别的兄弟元素p

   import requests
   from bs4 import BeautifulSoup

   url = "https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6"
   doc = requests.get(url).text
   soup = BeautifulSoup(doc, "lxml")
   No = soup.select("div[class='data'] table thead tr th[class='th-01']")[0].text
   keyword = soup.select("div[class='data'] table thead tr th[class='th-02']")[0].text
   trs = soup.select("div[class='data'] table tbody tr")
   nameList = []
   for tr in trs[1:]:
        ranktop = tr.select("td[class='td-01 ranktop']")[0].text
        name = tr.select("td[class='td-02'] a")[0].text
        href = "https://s.weibo.com" + tr.select("td[class='td-02'] a")[0].get("href")  # ["href"]
        num = tr.select("td[class='td-02'] span")[0].text  # 访问数
        nameList.append(name)

   print(nameList[:5])

3. 获取元素节点:

tag.parent - 获取当前节点的父节点，根节点的父节点是document（文档节点）,document的父节点的None
tag.children - 获取当前节点的所有直接子节点
tag.desendants - 获取当前节点的所有子孙节点
tag.next_sibling - 获取当前节点的下一个兄弟节点
tag.previous_sibling - 获取当前节点的上一个兄弟节点
tag.next_siblings - 获取当前节点后边的所有兄弟节点
tag.previous_siblings - 获取当前节点前边的所有兄弟节点

前言

Why BeautifulSoup?

使用

1. 创建BeautifulSoup对象的三种方式

2. 查

3. 获取元素节点:

更多请参考