题目:
用并发写一个小程序,获取有道翻译上的单词意思,比如 单词china
爬取url:http://dict.youdao.com/w/eng/china/
输出的数据结构:单词的名称,单词的发音,单词的内容。例如
{'Word': 'china','Proc': '', 'Desc':''}
提示:大家可以多线程,也可以用线程池(ThreadPoolExcutor)
代码思路:
1、先写一个函数去下载http://dict.youdao.com/w/eng/china/
2、然后解析这个页面,解析可以用pyquery,这个库非常好用,大概只要几行代码可以解析
3、然后用多线程去处理上面的task
from pyquery import PyQuery as pq
import requests
import threadpool
def download_html(word):
output = {'Word': word}
final_output = {}
url = 'http://dict.youdao.com/w/eng/{}/'.format(word)
try:
r = requests.get(url)
if r.status_code == 200:
doc = pq(r.text)
final_output = decode_html(doc, output)
print(final_output)
except Exception as e:
print('抓取页面异常,抓取不到:' + word)
return None
return final_output
def decode_html(doc, output):
output['Proc'] = ''
output['Desc'] = ''
for pro in doc.items('.baav .pronounce'):
output['Proc'] = output['Proc'] + pro.text()
for li in doc.items('#phrsListTab .trans-container ul li'):
output['Desc'] = output['Desc'] + li.text()
return output
word_list = ['china', 'nice', 'python', 'beautiful', 'girl']
pool = threadpool.ThreadPool(10)
word_pool = threadpool.makeRequests(download_html, word_list)
[pool.putRequest(req) for req in word_pool]
pool.wait()