Python爬虫实战——爬取《斗破苍穹》全文小说（基于re模块）

目标

爬取《斗破苍穹》全文小说，网址：http://www.doupoxs.com/doupocangqiong/

思路

手动浏览前几章节，观察url网址变化，以下为前4章节网址：

http://www.doupoxs.com/doupocangqiong/2.html

http://www.doupoxs.com/doupocangqiong/5.html

http://www.doupoxs.com/doupocangqiong/6.html

http://www.doupoxs.com/doupocangqiong/7.html

可以看到，第1和第2章节没有明显规律，第2章节以后规律明显，通过数字递加来分页。手动输入http://www.doupoxs.com/doupocangqiong/3.html，会发现是404错误页面。

所以，具体的思路为：从第1章开始构造URL，中间有404错误就跳过不爬取。

具体代码如下：

import requests
import time, re

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

f = open('doupo.txt', 'a+')        #新建txt文档，以追加方式写入
def getInfo(url):
    r = requests.get(url, headers=headers)
    if r.status_code == 200:            #判断请求码是否为200
        contents = re.findall("<p>(.*?)</p>", r.content.decode('UTF-8'), re.S)        #正则获取所有<p>标签内容，注意需要解码，不然会出错。
        for content in contents:
            f.write(content + '\n')        
    else:
        pass

if __name__ == '__main__':
    urls = ["http://www.doupoxs.com/doupocangqiong/{}.html"\
            .format(str(i)) for i in range(2, 1665)]
    for url in urls:
        print('正在爬取'+url)
        getInfo(url)
        time.sleep(2)

    f.close()

image

Python爬虫实战——爬取《斗破苍穹》全文小说（基于re模块）

目标

思路

具体代码如下：

推荐阅读更多精彩内容