[Python]BeautifulSoup 4 notes

BS4

BeautifulSoup是用来从HTML or XML中提取数据的Python lib。BeautifulSoup将文档转化为树形结构（DOM），每个节点都是下述四种类型的Python对象：

BeautifulSoup <class 'bs4.BeatifulSoup'>
Tag <class 'bs4.element.Tag'>
NavigableString <class 'bs4.element.NavigableString'>
Comment <class 'bs4.element.Comment'>

从集合角度理解以上4中类的关系（类概念上并不准确）

BeautifulSoup 为全集（将Document以入参传入生成BeautifulSoup object），包含 Tag子集
Tag 包含 NavigableString 子集
Comment 为 NavigableString 特殊集合

Usage

BeautifulSoup的第一个入参是Document，第二个入参指定Document parser 类型.

from bs4 import BeautifulSoup
import requests, re

url = 'http://m.kdslife.com/club/'
# get whole HTTP response
response = requests.get(url)
# args[0] is HTML document, args[1] select LXML parser. returned BeautifulSoup object
soup = BeautifulSoup( response.text, 'lxml')
print soup.name
# [document]'
print type(soup)
# <class 'bs4.BeatifulSoup'>

Sample codes for Tag objects

# BeutifulSoup --> Tag 
# get the Tag object(title)
res = soup.title
print res
# <title>KDS Life</title>

res = soup.title.name
print res
# title

# attribules of a Tag object
res = soup.section
print type(res)
# <class 'bs4.element.Tag'>

print res['class']
# ['forum-head-hot', 'clearfix']

# All the attributes of section Tag object, returned a dict
print res.attrs
#{'class': ['forum-head-hot', 'clearfix']}

Sample codes for NavigableString object

# NavigableString object describes the string in Tag object
res = soup.title
print res.string
# KDS Life
print type(res.string)
# <class 'bs4.element.NavigableString'>

Sample codes for Comment object

# Comment, is a special NavigableString object
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
print type(comment)
# <class 'bs4.element.Comment'>

BS4 Parser

按照优先顺序自动解析，'lxml' --> 'html5lib' --> 'html.parser'

常用Tag对象方法

find_all()

find_all(name,attrs,recursive,text,**kwargs) 不解释，直接看代码

# filter, returned a matching list
# returned [] if matching nothing
title = soup.find_all('title')
print title
#[<title>Google</title>]

res = soup.find_all('div', 'topAd')
print res

# find all the elements whose id is 'gb-main'
res = soup.find_all(id='topAd')
print res
#[<div id="topAd">...</div>]

# find all the elements with 'img' tag and 'src' attribute matching the specific pattern
res = soup.find_all('img', src=re.compile(r'^http://club-img',re.I))
print res
# [![](http://upload-images.jianshu.io/upload_images/1876246-100fdca5a06a87b5.src?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240),
#...]

select()

# css selector
# select those whose tag's id = wrapperto
res = soup.select('#wrapperto')
print res
# [<div class="swiper-wrapper clearfix" id="wrapperto"></div>]

# select those 'img' tags who have 'src' attribute
res = soup.select('img[src]')
print res
#[![](http://upload-images.jianshu.io/upload_images/1876246-e154ab8cb1175dfd.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240), <im
#g src="http://club-img.kdslife.com/attach/1k0/gs/a/o41gty-1coa.png@0o_1l_600w_90q.src"/>]

# select those 'img' tags whose 'src' attribute is 
res = soup.select('img[src=http://icon.pch-img.net/kds/club_m/club/icon/user1.png]')
print res
# [![](http://upload-images.jianshu.io/upload_images/1876246-e154ab8cb1175dfd.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

Other

# get_text()
markup = '<a href="http://example.com/">\n a link to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup,'lxml')
res = soup.get_text()
print res
#  a link to example.com

res = soup.i.get_text()
print res
# example.com

# .stripped_string
res = soup.stripped_strings
print list(res)
# [u'a link to', u'example.com']

最后贴上一个简单的KDS图片爬虫

A KDS image spider

Note

BeautifulSoup进行了编码检测并自动转为Unicode. soup.original_encoding属性来获取自动识别编码的结果。
Input converts to unicode, output encodes with utf-8
在BS使用中，可配合 XPath expression使用

最后编辑于：2017.12.03 06:18:05

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,589评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,615评论 3赞 396
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 165,933评论 0赞 356
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,976评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,999评论 6赞 393
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,775评论 1赞 307
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,474评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,359评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,854评论 1赞 317
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,007评论 3赞 338
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,146评论 1赞 351
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,826评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,484评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,029评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,153评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,420评论 3赞 373
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,107评论 2赞 356