背景

近日，女王大人因工作需要，命令我将著名网站内容备份下来，作为学习研究之用。临危受命，于是晚上23点40分开始研究，终于于00点20分完成。不到一个小时，效率尚可，现将经验总结如下。

学习资料

https://github.com/TobiaszCudnik/phpquery phpQuery官方教程
http://www.fkphp.com/?p=49 phpQuery中文手册

过程

第一步分析来源网页

首先，找到列表页

http://sousuo.gov.cn/column/30469/0.htm
http://sousuo.gov.cn/column/30469/1.htm
……
http://sousuo.gov.cn/column/30469/56.htm

需要爬取的所有列表页url地址比较规则，所以先将这些地址存储到list_page.txt中，一行一个。

然后，分析列表页的内容

列表页是静态的html地址，没有加密和用jQuery动态渲染，所以不用分析接口和解析js了。

    <?php
// 你需要先把phpQuery的代码包下载到本地phpQuery目录中；
require('phpQuery/phpQuery/phpQuery.php');

$content = file_get_contents("list_page.txt");
$arr = explode("\n", $content);
foreach( $arr as $n=>$one) {
    if (strlen($one) ==0) continue;
        phpQuery::newDocumentFile($one);
        foreach( pq(".listTxt a") as $m=>$a ) {
                $href = pq($a)->attr("href");
                if (strlen($href) > 0 ) {
                        $title = pq($a)->html();
                        file_put_contents("artile_list.txt", "{$href};;{$title};;{$n};;{$m}\n", FILE_APPEND);
                }
        }
}

代码过程：

由于标题列表在listTxt样式类下的a列表，所以用phpQuery匹配器.listTxt a来表示。
匹配到的a是多个，所以用foreach来遍历。
$a是DOM对象，所以需要pq($a)来使用。
取属性用attr方法。
取标签内容用html方法。这样就获取到了标题和跳转链接。保存到article_list.txt备用。

然后，分析文章正文

获取一个article_list.txt中的地址，分析内容结构，发现正文都在.article样式下，所以直接用匹配器.article来获取。

$content = file_get_contents("artile_list.txt");
$arr = explode("\n", $content);
foreach( $arr as $one) {
        list($href, $title, $n, $m, $btn) = explode(";;", $one);
        phpQuery::newDocumentFile($href);
        $file = "articles/{$n}-{$m}-{$title}.txt";
        $html = trim(pq(".article")->text());
        file_put_contents($file, $html );
}

代码过程：

$n表示页数，$m表示第几篇文章；用来调试；
将爬取的内容放到本地article目录下，按找页+序号+标题的方式进行命名，方便调试；
整个过程就完了。

最后，优化代码

我们还要几个问题：

并非每篇文章都用.article能完全匹配到，其中有一半的badcase经过分析，需要用table[width='674']和table[width='650']来匹配。
有的文章抓取失败了，需要找出来，进行断点调试，所以增加$btn，支持跳过某些无法抓取的文章。
判断是否标题抓取成功过了，这种就跳过；
foreach最后可以break，在错误处停下，以快速调试单篇文章的抓取。
如果抓取的内容不正常，trim掉空格之后为空，则不要存入文件，等待后续修复之后继续重抓。
整个代码如下：

<?php

require('phpQuery/phpQuery/phpQuery.php');

$content = file_get_contents("list_page.txt");
$arr = explode("\n", $content);
foreach( $arr as $n=>$one) {
    if (strlen($one) ==0) continue;
        phpQuery::newDocumentFile($one);
        foreach( pq(".listTxt a") as $m=>$a ) {
                $href = pq($a)->attr("href");
                if (strlen($href) > 0 ) {
                        $title = pq($a)->html();
                        file_put_contents("artile_list.txt", "{$href};;{$title};;{$n};;{$m}\n", FILE_APPEND);
                }
        }
}

$content = file_get_contents("artile_list.txt");
$arr = explode("\n", $content);
foreach( $arr as $one) {
    if (strlen($one) ==0) continue;
        //echo $one . "\n";
        list($href, $title, $n, $m, $btn) = explode(";;", $one);
        if ($btn == 1 ) {
                continue;
        }

        $file = "articles/{$n}-{$m}-{$title}.txt";
        if (is_file($file) && strlen(file_get_contents($file)) > 0) {
                //echo "exists.{$href}\n";
                continue;
        }

        phpQuery::newDocumentFile($href);
        $html = trim(pq(".article")->text());
        if (strlen($html) == 0 ) {
                $html = trim(pq("#UCAP-CONTENT")->text());
        }
        if (strlen($html) == 0 ) {
                $html = trim(pq("table[width='674']")->text());
        }
        if (strlen($html) == 0 ) {
                $html = trim(pq("table[width='650']")->text());
        }
        if (strlen($html) > 0 ) {
                file_put_contents($file, $html );
        } else {
                echo "get failed.{$href}\n";
                exit;
        }
}

成品效果：

本地文件结果

利用phpQuery开发爬虫

利用phpQuery开发爬虫

背景

学习资料

过程

第一步分析来源网页

首先，找到列表页

然后，分析列表页的内容

然后，分析文章正文

最后，优化代码

利用phpQuery开发爬虫

背景

学习资料

过程

第一步 分析来源网页

首先，找到列表页

然后，分析列表页的内容

然后，分析文章正文

最后，优化代码

第一步分析来源网页