主要爬取的糗百文字版,格式比较统一,不需要对图片、视频进行判断。这次爬取只用了标准库,数据提取用了正则表达式。
设置了请求头
user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
headers = {'User-Agent': user_agent}
翻页
for page in range(1,5):
url = 'http://www.qiushibaike.com/text/page/'+str(page)+'/?s=4984889'
正则表达式
用(.*?)提取数据
pattern = re.compile('<h2>(.*?)</h2>.*?<div class="articleGender (.*?)Icon">(.*?)</div>.*?<div class="content">.*?<span>(.*?)</span>'+
'.*?<span class="stats-vote"><i class="number">(.*?)</i>.*?<i class="number">(.*?)</i>',re.S)
匹配了6个数据,正则太掏粪了,容易出错~可以拿来练手
完整代码
# -*- coding:utf-8 -*-
import urllib
import urllib2
import re
user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
headers = {'User-Agent': user_agent}
for page in range(1,5):
url = 'http://www.qiushibaike.com/text/page/'+str(page)+'/?s=4984889'
try:
#获取源码
request = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(request)
content = response.read().decode('utf-8')
#正则匹配
pattern = re.compile('<h2>(.*?)</h2>.*?<div class="articleGender (.*?)Icon">(.*?)</div>.*?<div class="content">.*?<span>(.*?)</span>'+
'.*?<span class="stats-vote"><i class="number">(.*?)</i>.*?<i class="number">(.*?)</i>',re.S)
items = re.findall(pattern,content)
for item in items:
print u"第%s页\n作者:%s\t性别:%s\t年龄:%s\n段子内容:%s\n好笑数:%s\t评论数:%s" % (page,item[0],item[1],item[2],item[3],item[4],item[5])
except urllib2.URLError, e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason
输出
正则还有个缺点是容易带入
,还需进行清洗。