python爬虫06-分析ajax请求爬取今日头条街拍美图存入mongodb

昨天学习了分析ajax来爬取动态加载的技术，今天来分享下成果。
ajax只是一种技术，不是一门语言，他是用利用XML向服务器请求，然后用JavaScript来渲染页面，达到页面地址不变，而内容改变的一种异步加载技术。

现在越来越多的网站采用这种技术，前后端分离是web发展的大趋势，因此，我们在用requests请求的得到的页面源码，可能只有一个<body></body>标签，而页面全都是利用JavaScript渲染而来。所以这就给我们爬取数据带来了麻烦。

分析ajax时要注意上传的参数，如果参数太复杂我们就不用分析ajax了，直接用Selenium和chromeDriver搭配使用直接获取渲染完成后的页面，即可见即可得。

我试了微博，结果参数太多，我分析不出规律。

今天就以头条街拍为例，来分析ajax爬取。

先打开头条，然后在搜索框里输入街拍，回车搜索：

image.png

然后就可以进入这个页面：

image.png

然后进入开发者模式，然后点network选项，在选择XHR过滤器，然后刷新页面，再一直向下翻就可以看到下面的场景：

image.png

点击第一条，会出来这个请求的请求头，响应，和其他信息：

image.png

观察到Request URL,这里的链接就是我们在向下拉的时候页面请求的链接，在点下面的几条，可以发现，只有offset和一个timestamp在变化，其他的几个参数是不变的。offset 是偏移量，每次加20，而timestamp，是我们电脑上的时钟的1000倍的整数部分，即:
time.time()*1000//1

所以我们就可以构造出请求一页的参数：

    params = {'aid': '24',
              'app_name': 'web_search',
              'offset': offset,
              'format': 'json',
              'keyword': '街拍',
              'autoload': 'true',
              'count': '20',
              'en_qc': '1',
              'cur_tab': '1',
              'from': 'search_tab',
              'pd': 'synthesis',
              'timestamp': int(time.time()*1000//1)

然后我们利用urllib.parse中的urlencode()将其编码，与基础链接构成请求链接，然后请求页面，返回response：

def get_page(offset):
    '''获取一页头条'''
    params = {'aid': '24',
              'app_name': 'web_search',
              'offset': offset,
              'format': 'json',
              'keyword': '街拍',
              'autoload': 'true',
              'count': '20',
              'en_qc': '1',
              'cur_tab': '1',
              'from': 'search_tab',
              'pd': 'synthesis',
              'timestamp': int(time.time()*1000//1)
              }
    headers = {
        'Accept': 'application/json, text/javascript',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-Hans-CN, zh-Hans; q=0.5',
        'Cache-Control': 'max-age=0',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Host': 'www.toutiao.com',
        'Referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763',
        'X-Requested-With': 'XMLHttpRequest'
    }
    base_url = 'https://www.toutiao.com/api/search/content/?'
    url=base_url+urlencode(params)
    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print("error", e.args)
        return None

接下来解释解析的到的json数据了，我们观察Preview一栏，发现，我们要的文章标题以及街拍图片都存在data中：

image.png

可以发现title项就是我们要的标题，图片在image_list里，但是图片的数量可能就只有一两张，我们这里就不深入到每一个文章里去找图片了，就把这几张照片保存即可，于是就有了下面的解析函数：

def parse_page(json):
    if json.get('data'):
        for item in json.get('data'):
            try:
                title = item.get('title')
                images = item.get('image_list')
            except:
                continue
            else:
                if title is None or images is None:
                    continue
                else:
                    for image in images:
                        yield {
                            'title': title,
                            'image': image.get('url')
                        }

这里返回的是一个生成器对象比较省内存，也好用。

现在有了title也有了图片的地址，就可以开始保存图片了，这里我么采用图片的md5值作为图片的名称，这样可以去除重复，当然这里图少也可以不用，然后就是将每一条数据保存到mongodb数据库中，这个数据库还挺好使的。

def save_img(item):
    title = item.get('title')
    image = item.get('image')
    if not os.path.exists(title):
        os.makedirs(title)
    try:
        response = requests.get(image)
        if response.status_code == 200:
            file_path = "{0}/{1}.{2}".format(title,
                                             md5(response.content).hexdigest(),
                                             'jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(response.content)
            else:
                print("图片已存在", file_path)
    except requests.ConnectionError:
        print("保存图片失败")

保存到数据库：

def insert_into_mongodb(item, collection):
    '''输入字典和要存入的集合'''
    result = collection.insert_one(item)
    print(result)

main（）函数接受一个offset值，然后执行获取页面，解析页面，保存图片，存储数据库等操作：

def main(offset):
    print("main",offset)
    client = MongoClient('mongodb://localhost:27017')
    db = client.toutiao
    collection = db.jiepai
    json = get_page(offset)
    for item in parse_page(json):
        print(item)
        save_img(item)
        insert_into_mongodb(item,collection)

这次我成功的用处了多进程，用的进程池Pool（）实现，但还是有点曲折，因为pycharm里运行多线程会卡死，但是在cmd。也就是双击文件运行就不会出问题，这是奇怪。多线程相关代码还要保存在
if __name__ == '__main__':这下面才能正常运行：

GROUP_START = 0
GROUP_STOP = 20
if __name__ == '__main__':
    freeze_support()
    pool = Pool()
    group = ([x*20 for x in range(GROUP_START, GROUP_STOP+1)])
    print(group)
    pool.map(main, group)
    pool.close()
    pool.join()

然后是运行结果：因为cmd界面运行完会直接退出，我就加了个input（）来等待我关。

image.png

但是最后他还是直接退出了。。。因为我之前刚运行一遍，所以会重复，明天你们运行下即很顺了。

这是爬下来的结果，总共今天昨天两次一共155条，：

文件夹：

image.png

第一张竟然是朱一龙。。。
数据库：

image.png

总之很成功！

加油！

下面给出全部的代码：

import os
import requests
import json
from pymongo import MongoClient
from hashlib import md5
from multiprocessing import Pool
from multiprocessing import freeze_support
from urllib.parse import urlencode
import time
def get_page(offset):
    '''获取一页头条'''
    params = {'aid': '24',
              'app_name': 'web_search',
              'offset': offset,
              'format': 'json',
              'keyword': '街拍',
              'autoload': 'true',
              'count': '20',
              'en_qc': '1',
              'cur_tab': '1',
              'from': 'search_tab',
              'pd': 'synthesis',
              'timestamp': int(time.time()*1000//1)
              }
    headers = {
        'Accept': 'application/json, text/javascript',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-Hans-CN, zh-Hans; q=0.5',
        'Cache-Control': 'max-age=0',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Host': 'www.toutiao.com',
        'Referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763',
        'X-Requested-With': 'XMLHttpRequest'
    }
    base_url = 'https://www.toutiao.com/api/search/content/?'
    url=base_url+urlencode(params)

    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print("error", e.args)
        return None

def parse_page(json):
    if json.get('data'):
        for item in json.get('data'):
            try:
                title = item.get('title')
                images = item.get('image_list')
            except:
                continue
            else:
                if title is None or images is None:
                    continue
                else:
                    for image in images:
                        yield {
                            'title': title,
                            'image': image.get('url')
                        }

def save_img(item):
    title = item.get('title')
    image = item.get('image')
    if not os.path.exists(title):
        os.makedirs(title)
    try:
        response = requests.get(image)
        if response.status_code == 200:
            file_path = "{0}/{1}.{2}".format(title,
                                             md5(response.content).hexdigest(),
                                             'jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(response.content)
            else:
                print("图片已存在", file_path)
    except requests.ConnectionError:
        print("保存图片失败")

def insert_into_mongodb(item, collection):
    result = collection.insert_one(item)
    print(result)

def main(offset):
    print("main",offset)
    client = MongoClient('mongodb://localhost:27017')
    db = client.toutiao
    collection = db.jiepai
    json = get_page(offset)
    for item in parse_page(json):
        print(item)
        save_img(item)
        insert_into_mongodb(item,collection)



GROUP_START = 0
GROUP_STOP = 20
if __name__ == '__main__':
    freeze_support()
    pool = Pool()
    group = ([x*20 for x in range(GROUP_START, GROUP_STOP+1)])
    print(group)
    pool.map(main, group)
    pool.close()
    pool.join()
    input()

在运行时，请先确保安装了相关的库，以及mongodb数据库和可视化工具。

这次的爬虫写的很完美，代码之间耦合性低，维护起来很容易！