我们知道python爬虫的解析库有很多,我们选取了lxml,bs4,re,pyquery,进行测试。
- bs4:纯python写的文档树解析库,它有4种解析器(lxml,html.parser,html5lib),我们测试的是lxml,主要可以通过标签进行定位,也可以通过css选择器进行定位
- pyquery:模拟前端jQuery写的python文档树解析库,用起来跟jQuery非常相似,用的都是css语法进行定位元素
- xpath:lxml是用c语言编写通过python调用的解析库,用的xpath语法
- re:python正则表达式库
4个库各有优缺点: - bs4更多的用于解析script标签的文本,因为它的速度实在太慢了
- re则是进行非结构化的文档进行匹配
- lxml底层是c实现的,在速度上毋庸置疑,同时易用性也很高
- pyquery使用更加比xpath和bs4更加灵活,PyQuery对象可以直接解析html文件,url(通过urllib进行请求返回结果),文档字符串
代码如下
"""
@Author: Jonescyna
@Created: 2020/12/28
"""
import requests
import time
import re
from pyquery import PyQuery as pq
from lxml import etree
from bs4 import BeautifulSoup
def cal_time(func):
def inner(*args, **kwargs):
start = time.time()
ret = func(*args, **kwargs)
print(f'{func.__name__}:{time.time() - start}s')
return ret
return inner
base_url = 'https://www.amazon.cn/b/ref=s9_acss_bw_cg_pccateg_2a1_w?node=106200071&pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-2&pf_rd_r=PQNKPPABQXAWCTZSNFXA&pf_rd_t=101&pf_rd_p=cdcd9a0d-d7cf-4dab-80db-2b7d63266973&pf_rd_i=42689071'
def get(url):
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0', }
resp = requests.get(url, headers=headers)
return resp.text
@cal_time
def parse_by_pq(html):
for _ in range(50):
doc = pq(html)
h2_list = doc('h2').items()
for h2 in h2_list:
h2.text()
@cal_time
def parse_by_xpath(html):
for _ in range(50):
doc = etree.HTML(html)
h2_list = doc.xpath('//h2')
for h2 in h2_list:
title = h2.xpath('./text()')[0]
@cal_time
def parse_by_bs4(html):
for _ in range(50):
soup = BeautifulSoup(html, 'lxml')
h2_list = soup.find_all('h2')
for h2 in h2_list:
title = h2.text
@cal_time
def parse_by_re(html):
for _ in range(50):
h2_list = re.findall(r'<h2 .*>\n(.*)\n<', html)
for h2 in h2_list:
title = h2
if __name__ == '__main__':
resp = get(base_url)
parse_by_pq(resp)
parse_by_xpath(resp)
parse_by_bs4(resp)
parse_by_re(resp)
执行结果
测试环境:本人用的是台式电脑进行的测试,win10系统配置为i5,16G内存(ddr3),不同的电脑跟网络环境直接影响解析速度,在相同的环境下,时间浮动不会太大
parse_by_pq:0.9650003910064697s
parse_by_xpath:0.761019229888916s
parse_by_bs4:2.878000020980835s
parse_by_re:0.01597905158996582s