BeautifulSoup库相关操作

解析库

解析器 使用方法 优势 劣势
python标准库 BeautifulSoup(markup, "html.parser") Python的内置标准库、执行速度适中、文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML解析器 BeautifulSoup(markup, "lxml") 速度快、文档容错能力强 需安装C语言库
lxml XML解析器 BeautifulSoup(markup, "xml") 速度快、唯一支持的XML的解析器 需安装C语言库
html5lib BeautifulSoup(markup, "html5lib") 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部拓展

基本使用

html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ; 
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

标签选择器

如果有一个则返回,如果有多个则返回第一个内容。

选择元素

html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(type(soup.title))
print(soup.title)
print(soup.head)
print(soup.p)
<class 'bs4.element.Tag'>
<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

获取名称

html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)
title

获取属性

html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
dromouse
dromouse

获取内容

html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)
The Dormouse's story

嵌套选择

html = """
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
""" 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.html.head.title.string)
The Dormouse's story

子节点和子孙节点

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)
['\n        Once upon a time there were three little sisters;and their names were\n        ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n        and \n        ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; \n        and they lived at the bottom of a well.\n    ']
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i, child)
<list_iterator object at 0x0000021D39E11BC8>
0 
        Once upon a time there were three little sisters;and their names were
        
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 
        and 
        
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 ; 
        and they lived at the bottom of a well.
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i, child)
<generator object Tag.descendants at 0x0000021D3AA4D248>
0 
        Once upon a time there were three little sisters;and their names were
        
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 
        and 
        
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 ; 
        and they lived at the bottom of a well.

父节点和祖先节点

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)
<p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        and 
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p>
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parents)
for i,parent in enumerate(soup.a.parents):
    print(i, parent)
<generator object PageElement.parents at 0x0000021D3AAD2D48>
0 <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        and 
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p>
1 <body>
<p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        and 
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p>
<p class="story">...</p>
</body>
2 <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        and 
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p>
<p class="story">...</p>
</body></html>
3 <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        and 
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p>
<p class="story">...</p>
</body></html>

兄弟节点

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body> 
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        and 
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        and they lived at the bottom of a well.
    </p> 
    <p class="story">...</p> 
"""  

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))
[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '\n        and \n        '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '; \n        and they lived at the bottom of a well.\n    ')]
[(0, '\n        Once upon a time there were three little sisters;and their names were\n        ')]

标准选择器

find_all(name, attrs, recursive, text, **kwargs)

可根据标签名、属性、内容查找文档

name
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print('-'*10)
print(type(soup.find_all('ul')[0]))
print('-'*10)
for i,ul in enumerate(soup.find_all('ul')):
    print(i, ul)
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
----------
<class 'bs4.element.Tag'>
----------
0 <ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
1 <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
attrs
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element')) #class_
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
text
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))
['Foo', 'Foo']

find(name, attrs, recursive, text, **kwargs)

可根据标签名、属性、内容查找文档
find返回单个元素,find_all返回所有元素

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(type(soup.find('ul')))
print(soup.find('ul'))
print(soup.find('page'))
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
None

find_parents() find_parent()

find_parents(name, attrs, recursive, text, **kwargs)返回所有祖先节点,find_parent()file回所以父节点

find_next_siblings(name, attrs, recursive, text, **kwargs) find_next_sibling()

返回后面所有兄弟节点,返回后面第一个兄弟节点

find_previous_siblings(name, attrs, recursive, text, **kwargs) find_previous_sibling()

返回前面所有兄弟节点,返回前面第一个兄弟节点

find_all_next() find_next()

返回节点后所有符合条件的节点,返回第一个符合条件的节点

find_all_previous() find_previous()

返回节点前所有符合条件的节点,返回第一个符合条件的节点

CSS选择器

通过select()直接传入CSS选择器即可完成选择

  • . class
  • # id
  • 标签直打
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print('-'*10)
print(soup.select('#list-1 .element'))
print('-'*10)
print(soup.select('.panel-body #list-2 li'))
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
----------
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
----------
[<li class="element">Foo</li>, <li class="element">Bar</li>]
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

获取属性

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])
list-1
list-1
list-2
list-2

获取内容

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())
Foo
Bar
Jay
Foo
Bar
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容