极简Scrapy爬虫3：爬取多级页面

运行环境：
* Python 2.7.12  
* Scrapy 1.2.2
* Mac OS X 10.10.3 Yosemite

继续爬取Scrapy 1.2.2文档提供的练习网址：

"http://quotes.toscrapy.com"

可以暂时不用考虑爬虫被封的情况，用于初级爬虫练习。

目标

爬取每位作者的介绍。

步骤1：通过scrapy shell定位

分析网页结构

每位作者都有一个详细的介绍页面。

首先需要找到介绍的入口网址。通过对每一条名言的html内容分析

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by 
          <small class="author" itemprop="author">Albert Einstein</small>
          <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world">             
            <a class="tag" href="/tag/change/page/1/">change</a>            
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>            
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>            
            <a class="tag" href="/tag/world/page/1/">world</a>           
        </div>
</div>

可以发现作者介绍的网址在：

        <a href="/author/Albert-Einstein">(about)</a>

测试定位

可以使用scrapy shell命令行，测试是否能定位到作者介绍的链接。在命令行中，使用

$ scrapy shell 'http://quotes.toscrape.com'

其中网址建议加上引号。Windows系统使用双引号。

方法1：CSS选择相邻兄弟标签

CSS的相邻兄弟选择，使用small.author+a::attr(href)定位。

small.author：定位到类名为author的small标签
“+”：表示紧邻的兄弟标签
+a::attr(href)表示紧邻的a标签，提取其中的href链接。

>>> author = response.css('small.author+a::attr(href)').extract_first()
>>> author
u'/author/Albert-Einstein'

方法2：XPath选择第二个span标签

//div[@class="quote"]/span[2]/a/@href的意思是：

//div[@class="quote"]：定位在类名为quote的div标签中
/span[2]：选择第2个span下的a标签
/a/@href：选择a标签的href属性。

>>> au = response.xpath('//div[@class="quote"]/span[2]/a/@href').extract_first() 
>>> au
u'/author/Albert-Einstein'

方法3：XPath选择相邻兄弟标签

XPath中"//"表示所有的元素。//small[@class="author"]/following-sibling::a[1]/@href中：

//small[@class="author"]：表示定位在类名author为small的标签
/following-sibling::a[1]：刚才small标签后面的同级别兄弟标签中第一个a标签
/@href：提取href链接属性。

>>> auth =response.xpath('//small[@class="author"]/following-sibling::a[1]/@href').extract_first()
>>> auth
u'/author/Albert-Einstein'

步骤2：代码编写与运行

爬虫名字更换为：name = 'quotes_2_3'

在上1个爬虫中的_parse()函数中，对每条名言的循环里面增加一个作者简介的爬取，增加三行内容。

            author_page = response.css('small.author+a::attr(href)').extract_first()
            authro_full_url = response.urljoin(author_page)
            yield scrapy.Request(authro_full_url, callback=self.parse_author)

表示找到作者介绍页面的链接，再拼接成绝对路径的链接，最后发出请求（scrapy.Request）并用回调函数parse_author()对作者的介绍页面解析。

对于函数parse_author()，结构与内容与普通页面解析大同小异。

    def parse_author(self,response):
        yield{
            'author': response.css('.author-title::text').extract_first(),
            'author_born_date': response.css('.author-born-date::text').extract_first(),
            'author_born_location': response.css('.author-born-location::text').extract_first(),
            'authro_description': response.css('.author-born-location::text').extract_first(),
        }

完整的代码如下

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes_2_3'
    start_urls = [
        'http://quotes.toscrape.com',
    ]
    allowed_domains = [
        'toscrape.com',
    ]

    def parse(self,response):
        for quote in response.css('div.quote'):
            yield{
                'quote': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }
            author_page = response.css('small.author+a::attr(href)').extract_first()
            authro_full_url = response.urljoin(author_page)
            yield scrapy.Request(authro_full_url, callback=self.parse_author)
            
        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            next_full_url = response.urljoin(next_page)
            yield scrapy.Request(next_full_url, callback=self.parse)

    def parse_author(self,response):
        yield{
            'author': response.css('.author-title::text').extract_first(),
            'author_born_date': response.css('.author-born-date::text').extract_first(),
            'author_born_location': response.css('.author-born-location::text').extract_first(),
            'authro_description': response.css('.author-born-location::text').extract_first(),
        }

运行使用

$ scrapy crawl quotes_2_3 -o results_2_3_01.json即可得到json文件的结果。