学习用Python爬取租房网站内容,包括房屋的租金、地址、房东昵称、性别、房屋图片
我的代码:
import bs4
import requests
import time
heads = {
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}
house_list_urls = ["http://sh.xiaozhu.com/search-duanzufang-p{}-0/".format(str(i)) for i in range(1,12)]
def get_house_info(url):
response = requests.get(url,headers = heads)
time.sleep(2)
soup = bs4.BeautifulSoup(response.text,"lxml")
title = soup.select('div.pho_info > h4 > em')[0].get_text()
address = soup.select('div.pho_info > p')[0].get('title')
price = soup.select('div.day_l > span')[0].get_text()
avator = soup.select('div.member_pic > a > img')[0].get('src')
sex = soup.select('div.member_pic > div')[0].get('class')[0]
sex = "male" if sex == "member_ico" else "female"
lord = soup.select("a.lorder_name")[0].get_text()
print(title,address,price,avator,sex,lord)
def get_houses(url):
response = requests.get(url,headers = heads)
soup = bs4.BeautifulSoup(response.text,'lxml')
house_list = [i.parent.get('href') for i in soup.select('img.lodgeunitpic')]
for i in house_list:
get_house_info(i)
for i in house_list_urls:
get_houses(i)
总结:
- select()返回的是list,哪怕是单个元素
- request.get(url,headers = xxx) 注意headers有"s"
- soup.get("class")返回的也是list
- 从房源列表中获取房源链接时,可以先定位img图片,再用parent属性获得a tag
- bs4.BeautifulSoup(response.text,'lxml') 不要忘了.text属性
问题:
- 为何抓取的图片链接无法打开?源码中明明是抓取的图片链接