简易版的网页爬虫,寻找网页中的图片链接,通过python完成。直接运行py文件即可,但需要在命令行中运行,不大方便,所以使用了python-script-converter或者pyinstaller 将py转成可执行文件,双击即可运行。
上代码img.py
#encoding:UTF-8
import sys
import urllib
import re
import os
# import urllib2
from bs4 import BeautifulSoup
def getImg(html):
html = urllib.urlopen(url)
page = html.read()
soup = BeautifulSoup(page, "html.parser")
imglist = soup.find_all('img') # 发现html中带img标签的数据,输出格式为<img xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx,存入集合
lenth = len(imglist) # 计算集合的个数
path = sys.path[0]
print(path)
pathArg = sys.argv[0]
print(pathArg)
filePath = os.path.dirname(os.path.realpath(pathArg))
print(filePath)
for i in range(lenth):
try:
imageUrl = getImageUrl(imglist[i])
index = i+1
print('[{0}-{1}]{2}'.format(index, lenth,imageUrl))
if(len(imageUrl)>0):
urllib.urlretrieve(imageUrl,filePath+'/'+'%s.jpg' % index)
except Exception as e:
print(e)
def getImageUrl(item):
imageUrl = ""
if(item.has_attr('src')):
imageUrl = item.attrs['src']
elif(item.has_attr('data-src')):
imageUrl = item.attrs['data-src']
else:
print(item)
for i in item.attrs:
if i.index('src') > -1:
imageUrl = item.attrs[i]
break
#
# try:
# imageUrl = item.attrs['src']
# except Exception as e:
# print(e)
# try:
# imageUrl = item.attrs['data-src']
# except Exception as e:
# print(e)
return getRealUrl(imageUrl)
def getRealUrl(url):
reg = r'http+?'
imgre = re.compile(reg)
imglist = re.findall(imgre, url)
totalSize = len(imglist)
realUrl = ""
if(totalSize>0):
realUrl = url
return realUrl
if(len(sys.argv)>1):
url = sys.argv[1]
print("url = "+sys.argv[1])
getImg(url)
else:
url = raw_input("please input url:")
print(url)
# url = "https://mp.weixin.qq.com/s/SBM1gq5i7ZfrE4GMBzK6dw"
getImg(url)
使用方法
- 新建文件夹
- 将img.py 拷贝到刚建文件夹
- 运行命令(xxx为网址),图片会下载在当前文件夹
python img xxx
py文件转可执行文件
- python-script-converter
https://github.com/ZYunH/Python-script-converter/blob/master/Readme-cn.md
psc test.py 2
chmod -x img.command
- pyinstaller
pyinstaller -F img.py