中秋3天假,学习了两天爬虫,之前在mooc上跟着昊天老师的课程学过一些,豆瓣电影top250是公认的容易爬取的网站,所以参考着众多实战案例自己写了一遍,自己写各种踩坑,总算集各家之长,写出一个比较满意的半成品,先放在简书中。
写爬虫一定要把框架打好,把爬取网页、解析网页、存储数据、判定目录文件是否存在这几部分分开,各用一个函数,对于分块攻克以及检查、通用替换等都非常有好处。
爬取网页用的是昊天老师的通用框架,强力推荐这个框架。
用的这几个库中,requests用于拉回网页内容、bs4进行网页内容解析、os用于判定、删除和建立文件及文件夹、time用于延时减轻服务器压力。
# import library
import requests
from bs4 import BeautifulSoup
import os
import time
# get html text function
def fGetHtmlText(vUrl):
try:
vHeaders = {"user-agent": "Mozilla/5.0", "Host": "movie.douban.com"}
r = requests.get(vUrl, headers = vHeaders)
r.raise_for_status()
r.encoding = r.apparent_encoding
return(r.text)
except:
print("There is something wrong")
# analysis the html text with bs4
def fSoup(vHtml):
vSoup = BeautifulSoup(vHtml, "html.parser")
vLis = vSoup.select("ol li")
for vLi in vLis:
# index
vIndex = vLi.find("em").text
# title
vTitle = vLi.find("span", class_= "title").text
# comment, and replace the unleagel string
vComment = vLi.find("span", class_ ="inq").text
vComment = vComment.replace("\u22ef", "")
# rating num
vRatingNum = vLi.find("span", class_ = "rating_num").text
fSvaeToFile(vIndex, vTitle, vComment, vRatingNum)
# judgeing if there is the file and folder
def fJudgeFile():
if os.path.exists("F:\\PythonData") == False:
os.mkdir("F:\\PythonData")
if os.path.exists("F:\\PythonData\\douban.csv") == True:
os.remove("F:\\PythonData\\douban.csv")
# save data
def fSvaeToFile(index, title, comment, ratingNum):
f = open('F:\\PythonData\\douban.csv', 'a')
f.write(f'{index}, {title}, {comment}, {ratingNum}\n')
f.closed
# main function
def main(vUrl):
vHtml = fGetHtmlText(vUrl)
vSoup = fSoup(vHtml)
# judge if the file exist and run the main function
fJudgeFile()
for i in range(1):
vUrl = "http://movie.douban.com/top250?start=" + str(25)
print("***正在爬取第" + str(i + 1)+ "页***")
main(vUrl)
time.sleep(2)
print("*****爬取完毕*****")