2018-08-28

爬虫的基本流程

一、发送HTTP请求（Request）

通过Python库向目标站点发送HTTP请求，等待服务器响应。
二、获取相应内容（Respinse）

如果服务器响应成功就会返回 200 码的相应内容。

如下图：

image

HTTP/1.1表示使用的是1.1版本的HTTP协议

200是状态吗，后面的ok是对状态码的简单描述。常见的状态码还有301（资源永久转移）、404（未找到资源）、500（服务器内部错误）等。

Content-Type 和 Content-Length都是相应结果的头部（header）。分别表示内容的类型和内容的长度。
三、解析结果（Ectract）
四、进一步处理数据
五、保存数据

=====================================================

python3 网络爬虫开发实践（书籍）

爬虫基础

一、网页基础
- 1. 节点树及结点间的关系
    - 在HTML中所有标签定义的内容就是节点，它们构成一个HTML BOM树。（DOM是w3c（万维网联盟）的标准，及文档对象模型。他定义了访问HTML和XML文档的标准）
    <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="reStructuredText" contenteditable="true" cid="n37" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">w3c文档对象模型（DOM）始终立于平台和语言的接口，他语序程序和脚本动态的访问和更新文档的内容、结构和样式。</pre>
    - 节点树的节点有层级关系、常用父（parent）、子（child）、兄弟（sibling）等属于描述。
  2. 选择器
二、爬虫的基本原理
1. 爬虫概述：
  - 获取网页
    
    获取网页源代码，包含了网页的邠有用信息，通过网页源代码，就可以从中提取想要的信息。
  - 提取信息
    
    最通用的方法是构造正则表达式提取分析网页源代码中的信息
    
    可以利用库（Beautiful Soup、pyquery 、lxml等）高效快速的从网页源代码中提取网页信息，如节点属性，文本值等
  - 保存数据
    
    一般将提取的数据保存文json文本或者TXT文本，也可以保存在数据库中，也可以保存在远程服务器中
  - 自动化程序
2. 找去怎样的数据
  - 找去来自给子对应的URL的HTML源代码、Json字符串（其中API接口大多采用这样的形式，这样的数据方便传输和解析）、可以是各种二进制数据（如图片、视频、音频等）、还可以是各种扩展名的文件
3. JS渲染页面
  - 渲染页面的原因为：再用urllib或request抓取页面时会出现源代码实际和浏览器中的不一样，例如在body节点里面只有一个id为container的节点，需要在body节点后面引入app.js，负责整的网站的渲染。
  - 使用基本的HTTP请求库来得到源代码可能跟浏览器中的页面代码不太一样，可以分析气候态Ajax接口，也可以使用Selenium、Splash这样的库来实现模拟JS渲染。
4. 会话和Cookies
  - 会话：在web中，会话对象用来存储特定用户会话所需的属性及配置信息。
  - Cookies：指某些网站为了辨别用户身份、进行绘画跟踪而存储在用户本地终端上的数据。
5. 代理的基本原理
  1. 代理实际上指的就是代理服务器（proxy server），它的功能就是代理网络用户去取得网络信息。使用代理服务器访问网页，在过程中web服务器识别出的真实IP不是本机的IP，达到IP的作用。
  2. 代理的作用
    - 突破自身IP限制的访问
    - 提高访问速度
    - 影藏真实IP，上王者也可以通过这种方式隐藏子的IP，免受攻击。防止自身IP被封锁
  3. 爬虫代理
    - 在信息爬取过程中通过不断更换代理，就不会被封锁。
  4. 代理分类
    
    代理即可以根据协议分类，也可以根据匿名程度分类
    1. 根据协议分类
      - FTP代理服务器：主要用于访问FTP服务器，一般有上传下载及缓存功能，端口一般为21、2121等
      - HTTP代理服务器：主要用于访问网页，一般有内容过滤和缓存功能，端口一班委80、8080、3128
      - SSL/TLS代理：主要用于访问加密网站，一般有SSL或TLS加密功能（最高支持128加密强度），端口一般为3128等。
      - RTSP代理：主要用于访问REAL流媒体服务器，一般有缓存功能，端口554
      - Telent代理：主要用于Telent远程控制，端口一般为23
      - POP3/SMTP代理：主要用于POPS/SMTP方式收发邮件，端口一般为110/25
      - SOSKS代理：只是单纯的传递数据包，不关心具体的协议和用法，端口一般为8080
    2. 根据匿名程度区分
      - 高度匿名代理：
      - 普通匿名代理
      - 透明代理
      - 间谍代理

二、基本库的使用

最基础的HTTP库有：urllib、httplib、request、trep等

一、使用urllib

urllib库是python的内置的HTTP请求库包涵一下4个模块
- request：它是最基本的HTTP请求模块，可用来模拟发送请求
- error：异常处理模块，如果出现请求错误，可以捕获这些异常
- parse：一个工具模块，提供了许多URL处理方法
- robotparser：主要用来识别万展的robots.txt文件

发送请求

urlopen（）
- urllib.request模块提供了最基本的构造HTTP请求的方法，可以模拟浏览器的一个请求发起过程，同时它还可以带有处理授权的验证（authenticaton）、重定向（redirection）、浏览器Cookies以及其他内容。
- HTTPResponse类型对象，主要包括read（）、readinto（）、getheader（name）、getheaders（）、fileno（）等方法。以及msg、version、status、reason、debuglevel等属性
- 通过调用不同的方法等到不同的显示结果：
  
  调用read（）方法可以得到返回得内容，调用status属性可以得到返回结果的状态码
  
  <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n164" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">import urllib.request
  
  response = urllib.request.urlopen('https://python.org')
  
  print(response.read().decode('utf-8'))
  
  print(response.status) #响应状态码
  
  print(type(response))
  
  200
  ================================================================================
  print(response.getheaders()) #响应头
  
  [('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '48809'), ('Accept-Ranges', 'bytes'), ('Date', 'Sat, 18 Aug 2018 09:45:54 GMT'), ('Via', '1.1 varnish'), ('Age', '1415'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2151-IAD, cache-hnd18742-HND'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '1, 948'), ('X-Timer', 'S1534585555.579001,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
  </pre>
  
  调用getheader（）方法并传递一个参数Server获取响应头中的Server的值
  
  getheaders()方法获取网页中的响应头。
  
  <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n167" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">>>> import urllib.request
  
  response = urllib.request.urlopen('https://www.python.org')
  print(response)
  <http.client.HTTPResponse object at 0x0000000002FCA898>
  print(response.getheaders())
  [('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '48807'), ('Accept-Ranges', 'bytes'), ('Date', 'Tue, 21 Aug 2018 01:20:46 GMT'), ('Via', '1.1 varnish'), ('Age', '1562'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2135-IAD, cache-nrt6136-NRT'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '2, 1133'), ('X-Timer', 'S1534814447.985583,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
  print(response.status)
  200
  print(response.getheaders('server'))
  Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
  print(response.getheaders('server'))
  TypeError: getheaders() takes 1 positional argument but 2 were given
  print(response.getheaders('Server'))
  Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
  print(response.getheaders('Server'))
  TypeError: getheaders() takes 1 positional argument but 2 were given
  print(response.getheader('Server'))
  nginx
  </pre>
  
  <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n168" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">利用基本的urlopen（）方法可以完成最基本的简单网页的抓取，关于urlopen（）函数的API ：</pre>
  
  <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n169" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">urllib.request.urlopen(url,data=None，[timeout]*,cafile=None,capath=None,cadefault=False,context=None) </pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" contenteditable="true" cid="n171" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> data参数

timeout参数：响应超时时间</pre>

Request

利用Request构建一个完整的请求:

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n177" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">>>> import urllib.request

request = urllib.request.Request('https://www.baidu.com')
response = urllib.request.urlopen(request)
print(response.read())</pre>

利用Request构造一个对象，通过构造的数据结构可以将请求独立成一个对象，可灵活的配置参数。

Request的参数构造方法如下：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n180" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">class urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)</pre>
- 参数url用于请求URL，必传参数
- 参数data，必须是bytes（字节流）类型，如果是字典，可以先用urllib.parse模块的urlencode（）编码。
- 参数headers是一个字典（请求头），可以构造请求时通过headers参数直接构造，也可以通过调用请求实例的add_headers（）方法添加。
  
  添加请求头的常用法就是通过修改User-Agent来伪装浏览器
高级用法
- HTTPRedirectHandler：用于处理重定向
- HTTPCookieProcessor：用于处理Cookies
- ProxyHandler ：用于设置代理，默认代理为空
- HTTPPasswordMgr：用于管理密码，他维护用户名和密码的表
- HTTPBasicAuthHandler：用于管理认证，入股一个链接打开时需要认证，可以用它来解决认证问题
- HTTPDefaultErrorHandler：用于处理HTTP响应错误，错误都会抛出HTTPError类型的
  
  <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n204" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">urllib.request模块中的BaseHandler类，它是其他Handler的父类，Handler 子类继承BaseHandler`类。
  - 代理
  - Cookies</pre>
1. Cookies
  
  Cookies的处理需要相关的Handler。
  
  Cookielib模块：主要的对象有CookieJar、FileCookieJar、MozillaCookieJar、LWPCookieJar。可以利用CookieJar了哦的对象来捕捉cookie并在后续连接请求是重新发送，比如模拟登陆功能。
  
  <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n210" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">import http.cookiejar, urllib.erquest
  cookie = http.cookiejar.CookieJar()
  handler = urllib.requestHTTPCookieProcessor(cookie)
  opener = urllib.request.build_opener(handler)
  response = opener.open('http://www.baidu.com')
  for item in cookie:
  print(item.name + '=' + item.value)</pre>
  
  先声明一个CookieJar对象。接下来就需要烈用HTTPCookieProcessor俩构建一个Handler，最后利用build_opener（）方法构建opener，来执行open（）函数。
  - Cookie生成文本格式：将CookieJar换成MozillaCookieJar，用于生成文件，是CookieJar子类。用于处理Cookie和文件相关的事件，比如读取和保存Cookies，可以将Cookies保存成Mozilla浏览器的Cookies格式。
  - LWPCookieJar也可以读取保存Cookies，保存为libwww-perl(LWP)格式的Cookies文件。
  - 读取并利用Cookies文件
    
    <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" contenteditable="true" cid="n219" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;"></pre>
    
    load（）方法读取本地的 Cookies 文件并获取Cookies的内容。

urlparse()

urllib 库里提供额parse模块，定义了处理URL的标准接口

利用urlparse()方法进行URL的解析

from urllib.parse import urlparse
result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')
print(type(result))
print(result)

运行结果：
<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

运行结果的是一个ParseResult类型对象，包含6个部分，分别是scheme、netloc、path、params、query、和fragment。
scheme : 协议类型； netloc：域名； path：访问路径； parms：代表参数（user）； query：查询条件； fragment：在#号后面世贸店，用于直接定位页面内部的下拉位置。

urlparse（）的APi用法：
urllib.parse.urlparse(urlstring，scheme=‘’，allow_fragments=True)

它有三个参数：
1.urlstring：带解析的URL
2.scheme：默认的协议（如http和https等），如果连接没有带协议，将scheme作为默认协议
3.allow_fragments：及是否忽略fragment。如果他被设置成False，fragment部分就会被忽略。北街西为path、parameters或者query的一部分，二fragment部分为空。
4.urlparse.result的结果为一个元组。

result = urlparse('https://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)
print(result)
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')

result = urlparse('https://www.baidu.com/index.html;user?id=5#comment',allow_fragments=True)
print(result)
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

2.urlunparse（）
他接受的参数是一个可迭代对象，但是它的长度必须是6（参数的个数），否则或抛出参数数量不足或者过多的问题。urlunparse（）中只能传递可迭代对象的变量。

result = ('https','www.baidu.com','index.html','user','a=6','comment')

print(urlunparse(result))
https://www.baidu.com/index.html;user?a=6#comment

3.urlsplit（）

from urllib.parse import urlsplit
result = urlsplit('https://www.baidu.com/index.html;user?a=6#comment')
print(result)
SplitResult(scheme='https', netloc='www.baidu.com', path='/index.html;user', query='a=6', fragment='comment')

4.urlunsplit（）
将链接给个部分分组合成完整的连接方式，传入的参数也是一个可迭代的对象，其长度只能为5，也就是说只能为5个参数。

result = ('https','www.baidu.com','index.html;user','a=6','comment')
print(urlunsplit(result))
https://www.baidu.com/index.html;user?a=6#comment

urlunparse（）和 urlunsplit（）方法，可以完成连接的合并，前提是长度（参数）是固定的，连接的每一部分要清晰的分开。

5.urljoin（）
urljoin（）方法，将base_url（基础链接）作为第一个参数。将新的连接作为第二个参数，分析base_url的scheme、netloc、path这三个内容并对新连接缺失的部分进行补充，最后返回结果。
通过urljoin（）方法，实现临街的解析、拼合与生成。

from urllib.parse import urljoin
print(urljoin('https://baidu.com','https://www.baidu.com/index.html;user?a=6#comment'))
https://www.baidu.com/index.html;user?a=6#comment

6.urlencode()
为了更加方便的构造参数，信用字典来表示参数，用urlencode方法将其序列化为GET请求的参数。

from urllib.parse import urlencode
params = { 'name':'germey','age':22}
base_url = 'https://www.baidu.com?'
url = base_url + urlencode(params)
print(url)
https://www.baidu.com?name=germey&age=22

7.parse_qs（）
反序列化，将GET请求参数，转化回字典，其值转化成字符串形式并将值放在列表中。

from urllib.parse import parse_qs
query = 'name=germey&age=22'
print(parse_qs(query))
{'name': ['germey'], 'age': ['22']}

8.parse_qsl（）
将GET请求参数转化成元组组成的列表

from urllib.parse import parse_qsl
print(parse_qsl(query))
[('name', 'germey'), ('age', '22')]

9.quote（）
将内容转化成URL编码的格式。可以将中文字符转化成URL编码，解决中文乱码的问题

from urllib.parse import quote
key = '壁纸'
url = 'https://www.baidu.com' + quote(key)
print(url)
https://www.baidu.com%E5%A3%81%E7%BA%B8

10.unquote（）
解码

from urllib.parse import unquote
print(unquote('https://www.baidu.com%E5%A3%81%E7%BA%B8'))
https://www.baidu.com壁纸

常用的URL处理方法：unquote（解码）、quote（编码）、parse_qsl（将GET请求参数转化成元祖形式的列表）、parse_1s（将GET请求参数转化成字典形式）、urljoin（将（base_url）与第二个参数链接合并）、urlencode（构造参数）、urlunsplit（只能合并长度（参数个数）为5）、urlsplit（解析URL返回参数个数5）、urlunparse（合并url但url中的参数个数为6）、urlparse（解析url为6个参数）。</pre>

分析Robots协议

二、使用requests

1. 实力引入
  
  urllib库中的urlopen（）方法实际上是以GET方式请求网页。
2. GET请求
  - 基本实例：
    
    <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n240" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">import requests
    r = requests.get('http://httpbin.org/get')
    print(r.text)
    
    {
    "args": {},
    "headers": {
    "Accept": "/",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "close",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.19.1"
    },
    "origin": "222.209.11.28",
    "url": "http://httpbin.org/get"
    }</pre>
    
    请求内容包括：请求头、URL、IP等信息
3. post请求：
  
  <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n244" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">>>> r = requests.post('http://httpbin.org/post',data=data)
  
  print(r.text)
  
  {
  "args": {},
  "data": "",
  "files": {},
  "form": {
  "age": "22",
  "name": "gemery"
  },
  "headers": {
  "Accept": "/",
  "Accept-Encoding": "gzip, deflate",
  "Connection": "close",
  "Content-Length": "18",
  "Content-Type": "application/x-www-form-urlencoded",
  "Host": "httpbin.org",
  "User-Agent": "python-requests/2.19.1"
  },
  "json": null,
  "origin": "222.209.11.28",
  "url": "http://httpbin.org/post"
  }</pre>
  
  将post请求的结果返回（提交）到form中
4. 响应
高级用法
1. 文件上传
  
  <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n253" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">files = {"files":"favicon.ico"}
  r = requests.post('http://httpbin.org/post', files=files)
  print(r.text)</pre>
2. cookies
  
  <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n256" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">r = requests.get('https://www.baidu.com')
  print(r.cookies)
  for key,value in r.cookies.items():
  print(key + '=' + value)</pre>
3. 会话维持
  - 在requests中，可以直接利用get（）和post（）等方法可以做到模拟网页的请求，相当于模拟了浏览器登录。
  - 会话维持：
    
    用session（）可以做到模拟会话而不用担心cookies的问题，他通常用于模拟登陆成功之后进行下一步操作。
4. SSL证书验证
5. 代理设置
  - 使用proxies参数可以避免在大规模爬取过程中出现ip地址被封禁。
  - requests除了支持HTTP代理外，还支持SOCKS协议代理
6. 超时设置
  
  防止服务器在规定时间内不能响应，而设置的一个超时时间，超过设置时间就报错。
  
  <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n277" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">response = requests.get('https://www.baidu.com',timeout=1)
  
  实际上请求为两个阶段，即链接（connet）和读取（read）。也可以用元组的写入方式。
  想要永久等就设置timeout的值为None或者为空值。</pre>
7. 身份验证
  
  使用requests自带验证功能。
8. Prpared Request
  
  <pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n283" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">from requests import Request,Session
  url='https://www.baidu.com/s?wd=windows%E5%A6%82%E4%BD%95%E8%A7%A3%E5%86%B3%E7%9B%AE%E6%A0%87%E8%AE%A1%E7%AE%97%E6%9C%BA%E7%A7%AF%E6%9E%81%E6%8B%92%E7%BB%9D%E8%AE%BF%E9%97%AE%E7%9A%84%E9%97%AE%E9%A2%98&rsv_spt=1&rsv_iqid=0xb5101a30000105ed&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=0&oq=%25E5%25A6%2582%25E4%25BD%2595%25E8%25A7%25A3%25E5%2586%25B3%25E7%259B%25AE%25E6%25A0%2587%25E8%25AE%25A1%25E7%25AE%2597%25E6%259C%25BA%25E7%25A7%25AF%25E6%259E%2581%25E6%258B%2592%25E7%25BB%259D%25E8%25AE%25BF%25E9%2597%25AE%25E7%259A%2584%25E9%2597%25AE%25E9%25A2%2598&rsv_t=406dVTwcDN6qA3Ks2jTy3qzp%2BfzOihc1gr51HhBUE6Uzz9Y18rl68giIDEJ2QyqaLJmS&rsv_pq=98edc1a700057ea3&inputT=4290&rsv_sug3=216&rsv_sug4=5434'
  data = {
  'name':'germey'
  }
  headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
  }
  s = Session()
  req = Request('POST', url, data=data, headers=headers)
  prepped = s.prepare_request(req)
  r = s.send(prepped)
  print(r.text)
  </pre>
  
  引入Request，用参数url、data、headers参数构造Request对象，再调用Session的prepare_request（）方法将其转换成Prpare Request对象，然后再调用send（）方法发送即可。

正则表达式

实例引入

match（）

match（）方法是从字符串开头开始匹配，如开头匹配成功，整个匹配失败。

match（）方法中第一个参数是正则表达式、第二个参数是参入要匹配的字符串。

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n295" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">import re
content = 'Hello'
result = re.match('\w{5}', content)
print(result)</pre>

匹配目标：姜子字符串括起来，对应一个分组，调用group（）方法传入分组的索引即可获取提取的结果。

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n299" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">import re
content = 'Hello 123 456'
result = re.match('(\w{5})\s(\d{3})\s(\d{3})', content)
print(result.group(2))</pre>
通用匹配

以.*代表匹配任意字符

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n303" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">content = 'Hello 123 456 mongodb'
result = re.match('^Hello.*mongodb/pre>,content)
print(result.group())</pre>
贪婪与非贪婪的匹配方法

贪婪：尽可能匹配多的字符

非贪婪：尽可能匹配少的字符

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n308" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">非贪婪：
content = 'Hello 123456 mongodb demo'
result = re.match('^Hello.?(\d+).demo/pre>,content)
print(result.group(1))</pre>

\w	匹配字母、数字、下划线
\s	匹配任意空白字符等价于[\t\n\r\f]
\d	匹配任意数字
\z	匹配字符串的结尾，如果有换行符匹配换行符
\G	匹配最后匹配完成的位置
\n	匹配一个换行符
\t	匹配一个制表符
^	匹配一行字符串的开头
$	匹配一行字符串的结尾
.	匹配任意字符
[ask]	匹配一组字符：例如匹配a、s或者k

</figure>

修饰符

匹配特殊情况下的字符

re.I 使匹配对大小不敏感

re.L 做本地化识别（locale-aware）匹配

re.M 多行匹配，影响^和$

re.S 使 . 匹配包括换行在内的所有字符

re.U 根据Unicode字符集解析字符。这个标志影响\w、\W、\b 和 \B

re.X 该标志通过给与你更灵活的格式一遍你将正则表达式写的更易于理解。

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n366" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">content = 'Hello 123 '
'456 mongodb'
result = re.match('^Hello.?(\d+\s\d+).mongodb/pre>,content,re.S)
print(result.group(1))</pre>
转义匹配

遇到用于正则模式的特殊字符时，在字符前面加反斜线转义即可。

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n370" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">content = '(百度)www.baidu.con'
result = re.match('(百度)www.baidu.con',content)
print(result.group())</pre>

search（）

扫描整个字符串，然后返回第一个成功匹配到的结果。

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n374" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">content = '<li data-view="4" class="activte">'
'<a href="/2.mp3" singer="任贤齐">长海一笑</a>'
'<li data-view="4" class="activte">'
'<a href="/2.mp3" singer="王伦">长一笑</a>'

result = re.findall('<li.*?singer="(\w+)">(\w+)</a>', content,re.S)
print（result）</pre>
findall（）

扫描整个字符串，将获取所有成功匹配到的结果。所有结果将以元组的形式存储在列表中。

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n378" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">result = re.findall('<li.*?singer="(\w+)">(\w+)</a>', content,re.S)
for item in result:
print(item)</pre>

获取第一个内容时，使用search（）方法。获取所有结果使用findall（）。
sub（）

可以利用sub（）方法来修改文本，在获取数据

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n383" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit;">result = re.sub('\d+','',content)
第一个参数：匹配的内容
第二个参数：填充的内容
第三个参数：匹配的字符串</pre>
compil（）

将正则表达式字符串编译成正则表达式对象，以便于后面匹配中复用。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 216,240评论 6赞 498
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,328评论 3赞 392
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 162,182评论 0赞 353
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,121评论 1赞 292
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,135评论 6赞 388
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,093评论 1赞 295
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,013评论 3赞 417
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,854评论 0赞 273
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,295评论 1赞 310
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,513评论 2赞 332
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,678评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,398评论 5赞 343
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,989评论 3赞 325
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,636评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,801评论 1赞 268
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,657评论 2赞 368
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,558评论 2赞 352