Python 的 Beautiful Soup 库

Beautiful Soup 4已经被移植到BS4了，所以要

from bs4 import BeautifulSoup

创建 beautifulsoup 对象

soup = BeautifulSoup(html, 'lxml')

另外，我们还可以用本地 HTML 文件来创建对象，例如

soup = BeautifulSoup(open('index.html'), 'lxml')

格式化输出：

print soup.prettify()

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

上面是示例文档，后面演示的都是搜索上面 html_doc 中的内容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

BeautifulSoup 过滤器

这些过滤器可以被用在 tag 的 name 中，节点的属性中，字符串中或他们的混合中。

1.字符串
传入字符串，查找与字符串完整匹配的内容

2.正则表达式
传入正则表达式 re.compile(规则)，通过正则表达式来匹配内容

3.列表
传入列表参数，如 ['div','a','b']，返回与列表中任一元素匹配的内容。

4.True
匹配任何值

5.方法
还可以传入自定义函数

find_all 方法

搜索当前 tag 的所有子节点,并判断是否符合过滤器的条件。
返回的结果是所有符合条件的 tag 组成的列表。

语法：

find_all( name , attrs , recursive , string , **kwargs )

name 参数

查找所有名字为 name 的 tag
name 参数的值可以使任一类型的过滤器：字符串，正则表达式，列表，方法或是 True .

a. 给 name 参数传入字符串

soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all('b')
# [<b>The Dormouse's story</b>]

print soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

b.传入正则表达式

# 找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

# 找出所有名字中包含”t”的标签:
for tag in soup.find_all(re.compile("t")):
    print(tag.name)
# html
# title

c.传入列表

# 找到文档中所有<a>标签和<b>标签:
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

d.传入True
True 可以匹配任何值,下面代码查找到所有的tag

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

keyword 参数

如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索
参数类型包括：字符串 , 正则表达式 , 列表, True .

1.传入字符串

# 找到所有属性名为 id 且属性值为 link2 的字符串
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

2.传入正则表达式

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

3.True

# 查找所有包含 id 属性的tag,无论 id 的值是什么:
soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

同时匹配多个属性

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

想搜索 class 属性，但 class 是 python 关键字，所以用 class_ 代替

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

HTML5中的 data-* 属性在搜索不能使用

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag

data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

string 参数

匹配文档中的字符串内容，返回字符串列表
string 参数接受字符串 , 正则表达式 , 列表, True 和方法

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

可以与其它参数混合使用

# 搜索内容里面包含“Elsie”的<a>标签:
soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

limit 参数

设置匹配上限，当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果

# 文档树中有3个tag符合搜索条件,但结果只返回2个
soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

recursive 参数

find_all() 方法时默认检索当前 tag 的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False

一段简单的文档:

<html>
<head>
  <title>
   The Dormouse's story
  </title>
</head>
...

是否使用 recursive 参数的搜索结果:

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]
 
soup.html.find_all("title", recursive=False)
# []

<title>标签在 <html> 标签下, 但并不是直接子节点, <head> 标签才是直接子节点. 在允许查询所有后代节点时能够查找到 <title> 标签. 但是使用了 recursive=False 参数之后,只能查找直接子节点,这样就查不到 <title> 标签了.

find_all() 几乎是Beautiful Soup中最常用的搜索方法,所以我们定义了它的简写方法.

下面两行代码是等价的:

soup.find_all("a")
soup("a")

这两行代码也是等价的:

soup.title.find_all(string=True)
soup.title(string=True)

find() 方法

find( name , attrs , recursive , string , **kwargs )

find() 与 find_all() 区别：

find_all() 返回符合条件的所有 tag，find() 只返回符合条件的第一个 tag

find_all() 返回结果是列表,而 find() 方法直接返回结果.

find_all() 方法没有找到目标时返回空列表, find() 方法找不到目标时,返回 None .

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

标签的属性 attrs

把标签的所有属性打印输出了出来，结果为字典类型

print soup.p.attrs
#{'class': ['title'], 'name': 'dromouse'}

单独获取某个属性的值

print soup.p['class']
#['title’]

print soup.p.get('class')
#['title']

select() 方法

用 CSS 选择器的语法来筛选元素，返回 tag 列表

CSS选择器语法：
标签名不加任何修饰，类名（class="className"引号内即为类名）前加点，id名id="idName”引号内即为id名）前加 #

通过 tag 名查找

print soup.select('title') 
#[<title>The Dormouse's story</title>]

print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过类名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过 id 名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

通过属性查找

查找时还可以加入属性元素，属性需要用中括号括起来

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

是否存在某个属性来查找:

soup.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

多个查找条件属于同一 tag 的，不用空格隔开；

多个查找条件不属于同一 tag 的，用空格隔开。

（同时符合条件1和条件2的 tag）
选择标签名为 a，id 为 link2 的 tag：

soup.select('a#link2’)
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

tag 之间的包含查找
查找标签 p 中，id 等于 link1 的 tag，二者需要用空格分开

print soup.select('p #link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

找到某个tag标签下的直接子标签

soup.select("head > title")
# [<title>The Dormouse's story</title>]

soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("body > a")
# []

同时用多种CSS选择器查询（符合条件1或条件2的tag）:

soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

用 beautifulsoup 获取 HTML 网页源码里的内容，想删除或替换里面的

使用 \xa0

>>> soup = BeautifulSoup('<div>a&nbsp;b</div>', 'lxml')
>>> soup.prettify()
u'<html>\n <body>\n  <div>\n   a\xa0b\n  </div>\n </body>\n</html>'

.text与.string

在用find()方法找到特定的tag后，想获取里面的文本，可以用.text属性或者.string属性。

在很多时候，两者的返回结果一致，但其实两者是有区别的

例如 html 像这样：

1、<td>some text</td> 
2、<td></td>
3、<td><p>more text</p></td>
4、<td>even <p>more text</p></td>

.string 属性得到的结果

1、some text
2、None
3、more text
4、None

.text 属性得到的结果

1、some text

2、more text
3、even more text

.find和.string之间的差异：

第一行，td没有子标签，且有文本时，两者的返回结果一致，都是文本

第二行，td没有子标签，且没有文本时，.string返回None，.text返回为空

第三行，td只有一个子标签时，且文本只出现在子标签之间时，两者返回结果一致，都返回子标签内的文本

第四行，最关键的区别，td有子标签，并且父标签td和子标签p各自包含一段文本时，两者的返回结果，存在很大的差异：

.string返回为空，因为文本数>=2，string不知道获取哪一个

.text返回的是，两段文本的拼接。

使用 BeautifulSoup 提取网页内容 demo

# python3
# -*- coding: utf-8 -*-
# Filename: BeautifulSoup_demo.py
"""
练习使用 BeautifulSoup 提取网页内容

@author: v1coder
"""

import requests
from bs4 import BeautifulSoup


headers = {'User-Agent':''}


# 爬取biqukan.com 的小说《一念永恒》
def biqukan_com():
    url = 'https://www.biqukan.com/1_1094/5403177.html'
    req = requests.get(url)
#    text_data = req.text 也可以
    content_data = req.content
    soup = BeautifulSoup(content_data, 'lxml')
#    print(soup.prettify())
    texts = soup.find_all('div', id="content")  
#    texts = soup.find_all('div', class_="showtxt") #也可以
    print(texts[0].text.replace('\xa0'*8,'\n'))
#biqukan_com()

# 爬取吐槽大会第三季每期标题
def tucaodahui_title():
    url = 'https://v.qq.com/detail/8/80844.html'
    data = requests.get(url).text
    soup = BeautifulSoup(data, 'lxml')
    titles = soup.find_all('strong', itemprop="episodeNumber")
    for title in titles:
        print(title.text)
#tucaodahui_title()

# 爬取简书纯文本文章
def jianshu():
    url = '//www.greatytc.com/p/713415f82576'
    data = requests.get(url, headers=headers).text
    soup = BeautifulSoup(data, 'lxml')
    texts = soup.find_all('div', class_="show-content-free")
    print(texts[0].text)
#jianshu()

# 豆瓣电影 TOP250
def douban_TOP250():
    url = 'https://movie.douban.com/top250'
    data = requests.get(url, headers=headers).text
    soup = BeautifulSoup(data, 'lxml')
    comments = soup.find_all('span', class_="inq")
    titles = soup.find_all('img', width="100")
    for num in range(len(titles)):
        # titles[num].get('alt') get方法，传入属性的名称，获得属性值
        title = str(num+1) + '.' + '《' + titles[num].get('alt') + '》'
        comment = '  ：' + comments[num].text
        print(title)
        print(comment)
        print()
#douban_TOP250()

# 电影天堂最新电影
def dytt():
    url = 'https://www.dytt8.net/'
    data = requests.get(url, headers=headers).content
    soup = BeautifulSoup(data, 'lxml')
    text = soup.find('div', class_="co_content8")
    dates = text.find_all('font')  # 得到日期
#   for date in dates[1:]:
#       print(date.text)
    names = text.select("td a")  # 得到电影名
    num = 1
    for name in names[2::2]:
        print(name.text)
        print(dates[num].text)
        print()
        num += 1
#dytt()

# 妹子图1，输出图片链接
def mmjpg():
    url = 'http://www.mmjpg.com/'
    data = requests.get(url, headers=headers).content
    soup = BeautifulSoup(data, 'lxml')
    url_tags = soup.find_all('img', width="220")
    for url_tag in url_tags:
        pic_url = url_tag['src']  # 获得属性值用 ['src'] 或 get('src')
        print(pic_url)
#mmjpg()

# 妹子图2，输出图片链接
def haopic_me():
    url = 'http://www.haopic.me/tag/meizitu'
    data = requests.get(url, headers=headers).content
    soup = BeautifulSoup(data, 'lxml')
    url_tags = soup.find_all('div', class_="post")
    for url_tag in url_tags:
        pic_url = url_tag.find('img')['src']  
        print(pic_url)
#haopic_me()

# 妹子图3，输出图片链接
def mzitu_com():
    url = 'https://www.mzitu.com/'
    data = requests.get(url, headers=headers).content
    soup = BeautifulSoup(data, 'lxml')
    url_tags = soup.find_all('img', class_='lazy')
    for url_tag in url_tags:
        pic_url = url_tag.get('data-original')
        print(pic_url)
#mzitu_com()

鸣谢：

Beautiful Soup的用法 | 静觅
 Beautiful Soup 4.4.0 文档
 BeautifulSoup解析网页

2018-12-27

最后编辑于：2018.12.27 22:09:53

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 216,997评论 6赞 502
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,603评论 3赞 392
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 163,359评论 0赞 353
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,309评论 1赞 292
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,346评论 6赞 390
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,258评论 1赞 300
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,122评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,970评论 0赞 275
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,403评论 1赞 313
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,596评论 3赞 334
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,769评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,464评论 5赞 344
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,075评论 3赞 327
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,705评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,848评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,831评论 2赞 370
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,678评论 2赞 354