Python: 爬取廖雪峰大神的python教程

爬虫四部曲：
1.从哪爬 where
2.爬什么 what
3.怎么爬 how
4.爬了之后信息如何保存 save

在网上看到了一个例子，他是用urllib写的，为了练手，我就用urllib2重新写了一下，然后针对实际情况做了必要的修改。

从哪爬

Python2.7教程

爬什么

整个Python2.7教程

怎么爬

在Chrome页面打开F12，就可以发现文章内容在节点

<div class="x-wiki-content">

只要找到这个节点，然后把内容写入到一个html文件即可。

content = soup.find("div", {"class": "x-wiki-content"})

爬了之后如何保存

主要就是拿到内容，拼接到一个html文件，然后保存下来就可以了。

    # title = soup.title
    # print title
    title = html.split("<title>")[1]
    title = title.split(" - 廖雪峰的官方网站</title>")[0]
    title = title.decode('utf-8').replace("/", " ")
    print title

    html = str(content)
    html = head + html + "</body></html>"
    # print html
    filename = path + "\\" + "%d" % actual_list.index(li) + title + ".html"
    # print filename
    output = open(filename, 'w')
    output.write(html)
    output.close()

最后编辑于：2017.12.04 00:53:47

Python: 爬取廖雪峰大神的python教程

从哪爬

爬什么

怎么爬

爬了之后如何保存

推荐阅读更多精彩内容