成果
- 2个函数分开测试的时候应该没问题
- 实际抓取的时候,封禁太厉害了,无法抓取太多页面和过多测试
- path_detail ='./resut_detail.txt' 用于存放抓取到所有的具体页面的具体细节结果
- path_links ='./resut_links.txt' 用于存放抓取到所有的具体页面的地址
- with open(path_detail,'a+') as text 不知道是否应该这样写,能保证不会把path_detail的文件给覆盖了,
他在循环中的位置是否正确暂时未得到验证 - with open(path_links,'a+') as text: 同上
代码
from bs4 import BeautifulSoup
import requests #有s
import time
path_detail ='./resut_detail.txt'
path_links ='./resut_links.txt'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
'Cookie':'abtest_ABTest4SearchDate=b; OZ_1U_2282=vid=v7f3c69fed80eb.0&ctime=1475593887<ime=0; OZ_1Y_2282=erefer=-&eurl=http%3A//gz.xiaozhu.com/fangzi/2303611027.html&etime=1475593887&ctime=1475593887<ime=0&compid=2282; _ga=GA1.2.1488476801.1475593889; gr_user_id=13bbe192-e386-4074-8ca0-a4a882ba66aa; gr_session_id_59a81cc7d8c04307ba183d331c373ef6=8d7a3db1-e35f-4f23-9ce3-e73afd78b45a; __utma=29082403.1488476801.1475593889.1475594056.1475594056.1; __utmb=29082403.1.10.1475594056; __utmc=29082403; __utmz=29082403.1475594056.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'
}
def get_detail(url_detail='http://gz.xiaozhu.com/fangzi/2303611027.html'):
time.sleep(15)
code = requests.get(url_detail)
print(code)
web_content = requests.get(url_detail)#注意是headers
soup = BeautifulSoup(web_content.text,'lxml')
titles = soup.select('div.pho_info h4 em')
addresses = soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > p')
rentals = soup.select('div.day_l')
images = soup.select('img#curBigImage')# id这样写?
landlord_photos = soup.select('div.member_pic > a > img')
landlord_genders = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > div')
landlord_names = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')
for title, address, rental, image, landlord_photo,landlord_gender, landlord_name in zip(titles, addresses, rentals, images, landlord_photos, landlord_genders, landlord_names):
landlord_gender = str(landlord_gender.get('class'))#str多此一举了。。
if landlord_gender == '[\'member_ico\']':
landlord_gender = '男'
elif landlord_gender == '[\'member_ico1\']':
landlord_gender = '女'
else:
landlord_gender = '未知'
date = {
'title': title.get_text(),
'address':address.get('title'),
'rental':rental.get_text(),
'image':image.get('src'),
'landlord_photo':landlord_photo.get('src'),
'landlord_gender':landlord_gender,
'landlord_name':landlord_name.get_text()
}
list_value = list(date.values())
with open(path_detail,'a+') as text: #如果是按特定的列要怎么排序??,不断的新增结果用a+?
text.write(str(list_value)+'\n')
print(date)
#get_detail()
url_list = ['http://gz.xiaozhu.com/tianhe-duanzufang-p{}-8/'.format(i) for i in range(1,2)] #先拿2页
def get_moreurls():
with open(path_links,'a+') as text:
for link in url_list:
time.sleep(2)
web_content = requests.get(link) # 注意是headers,如果要写
soup = BeautifulSoup(web_content.text, 'lxml')
link_lists = soup.select('#page_list ul.pic_list.clearfix li a.resule_img_a')
for detail_link in link_lists:
print(detail_link.get('href'))
text.write(str(detail_link.get('href')+'\n')) #采集到的链接记录下来
get_detail(url_detail=str(detail_link.get('href')))#对具体的链接继续信息采集
get_moreurls()
总结与问题
1.目前感觉这个采集还是不如用火车头工具的方便,不过火车头很难采集动态加载的数据,而且也是自己学了皮毛,也许python更擅长抓去数量级别更大的数据和其他的自动处理??
比如说我目前是希望采集“今日头条”,“一点资讯”,"微博"某些媒体一周内所有的文章,微博的传播和阅读,互动量,因为这些页面的页面都有动态加载,或许比较适合python
2.目前我们的案例只是“打印”出来,没有记录在txt或者csv里,自己应该再多尝试是否可行,也搞不清楚这里用字典的意义,其实用list去存储更方便我们自己的后续操作(排序,筛选)。。(目前我也只是偷懒按字符串存储,没有在具体按什么顺序去存储操作)
3.目前这个作业还不是实际能用的,对于反爬的网站不太好用:
- 如何能自动再尝试失败的抓取?或者继续循环抓取下一条,对抓取失败的链接进行记录(try,except,finally??)
- 怎么反抓取?time.sleep会和后面的多线程抓取矛盾吗?小猪网这个实在太变态,稍微抓取就失败,404,感觉这个不适合作为抓取连续的例子,更适合后续的反爬取练习。。