python爬虫常用解析库lxml、pyquery、bs4、re执行效率对比

我们知道python爬虫的解析库有很多，我们选取了lxml，bs4，re，pyquery，进行测试。

bs4：纯python写的文档树解析库，它有4种解析器(lxml,html.parser,html5lib),我们测试的是lxml，主要可以通过标签进行定位，也可以通过css选择器进行定位
pyquery：模拟前端jQuery写的python文档树解析库，用起来跟jQuery非常相似，用的都是css语法进行定位元素
xpath：lxml是用c语言编写通过python调用的解析库，用的xpath语法
re：python正则表达式库
4个库各有优缺点：
bs4更多的用于解析script标签的文本，因为它的速度实在太慢了
re则是进行非结构化的文档进行匹配
lxml底层是c实现的，在速度上毋庸置疑，同时易用性也很高
pyquery使用更加比xpath和bs4更加灵活，PyQuery对象可以直接解析html文件，url(通过urllib进行请求返回结果)，文档字符串

代码如下

"""
@Author: Jonescyna
@Created: 2020/12/28
"""

import requests
import time
import re
from pyquery import PyQuery as pq
from lxml import etree
from bs4 import BeautifulSoup


def cal_time(func):
    def inner(*args, **kwargs):
        start = time.time()
        ret = func(*args, **kwargs)
        print(f'{func.__name__}:{time.time() - start}s')
        return ret

    return inner


base_url = 'https://www.amazon.cn/b/ref=s9_acss_bw_cg_pccateg_2a1_w?node=106200071&pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-2&pf_rd_r=PQNKPPABQXAWCTZSNFXA&pf_rd_t=101&pf_rd_p=cdcd9a0d-d7cf-4dab-80db-2b7d63266973&pf_rd_i=42689071'


def get(url):
    headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0', }

    resp = requests.get(url, headers=headers)
    return resp.text


@cal_time
def parse_by_pq(html):
    for _ in range(50):
        doc = pq(html)

        h2_list = doc('h2').items()
        for h2 in h2_list:
            h2.text()


@cal_time
def parse_by_xpath(html):
    for _ in range(50):
        doc = etree.HTML(html)
        h2_list = doc.xpath('//h2')
        for h2 in h2_list:
            title = h2.xpath('./text()')[0]


@cal_time
def parse_by_bs4(html):
    for _ in range(50):
        soup = BeautifulSoup(html, 'lxml')
        h2_list = soup.find_all('h2')
        for h2 in h2_list:
            title = h2.text


@cal_time
def parse_by_re(html):
    for _ in range(50):
        h2_list = re.findall(r'<h2 .*>\n(.*)\n<', html)
        for h2 in h2_list:
            title = h2


if __name__ == '__main__':
    resp = get(base_url)
    parse_by_pq(resp)
    parse_by_xpath(resp)
    parse_by_bs4(resp)
    parse_by_re(resp)

执行结果

测试环境：本人用的是台式电脑进行的测试，win10系统配置为i5，16G内存(ddr3)，不同的电脑跟网络环境直接影响解析速度，在相同的环境下，时间浮动不会太大

parse_by_pq:0.9650003910064697s
parse_by_xpath:0.761019229888916s
parse_by_bs4:2.878000020980835s
parse_by_re:0.01597905158996582s

python爬虫常用解析库lxml、pyquery、bs4、re执行效率对比

代码如下

执行结果

推荐阅读更多精彩内容