通过驱动浏览器爬取OncoKB数据库中"基因——肿瘤——靶向药物"等信息。
1. 安装Chrome浏览器,下载并配置ChromeDriver,将其加入到环境变量中。
下图方框所示为需要提取的某个基因的相关信息。由于此页面经JavaScript动态渲染过,不适宜直接抓取,所以通过驱动浏览器抓取信息。
2. 准备需要抓取的基因集list文件gene_list.txt。
每行一个GeneSymbol,内容如下:
3. 通过python的webdriver包驱动Chrome浏览器,通过BeautifulSoup包得到网页信息,然后提取需要的信息。
代码如下:
import time, random, os
from queue import Queue
from threading import Thread
import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
def download_and_extract(gene_name):
"""
Download html file of gene_name, then extracted needed message to txt file
"""
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-logging"])
url = 'https://www.oncokb.org/gene/' + gene_name
browser = webdriver.Chrome(options=options)
browser.get(url)
try:
# 加载网页
element = WebDriverWait(browser, 20, poll_frequency=1).until(EC.presence_of_element_located((By.XPATH, '//*[@class="-loading-inner"]')))
# 保存网页
html_file = gene_name + ".html"
fr = open(html_file,'wb')
fr.write(browser.page_source.encode("gbk", "ignore"))
browser.quit()
fr.close()
# 提取信息
fh = open(html_file)
soup = BeautifulSoup(fh,'html.parser')
fh.close()
one = []
all = []
for a in soup.find_all(name = "div",attrs = "rt-td"):
if not a.text:
one.append(a.i.attrs['class'][2][6:])
elif re.match(r'\d+$', a.text):
one.append(a.text)
all.append(one)
one = []
elif len(a.contents) > 1:
one.append(''.join([re.sub(r'\<br/\>', r'; ', str(i)) for i in a.contents]))
else:
one.append(a.text)
# 写入txt文件
get_txt = gene_name + ".txt"
fw = open(get_txt,'w')
for a in all:
fw.write("\t".join([str(i) for i in a]) + "\n")
fw.close()
except:
error_message = 'Error occurs when download %s html' % gene_name
print(error_message)
def get_html_one_queue(in_q):
"""
The queue needed to be executed each time.
"""
time.sleep(random.randint(5,10))
while in_q.empty() is not True:
gene_name = in_q.get()
download_and_extract(gene_name)
in_q.task_done()
def main():
queue = Queue()
with open ('gene_list.txt') as f_in:
gene_list = f_in.read().split('\n')
for gene_name in gene_list:
if gene_name:
url = '{gene_name}'.format(gene_name = gene_name)
queue.put(url)
print('Queue start: %d tasks.' %queue.qsize())
if not os.path.exists("download"):
os.mkdir("download")
os.chdir("download")
# 保证同时执行5个任务
for index in range(5):
thread = Thread(target=get_html_one_queue, args=(queue,))
thread.start()
queue.join()
print('Queue end.')
if __name__ == "__main__":
main()
4. 在gene_list.txt同一路径下运行上述代码,最终得到一个download文件夹,里面包含基因对应的网页文件和提取信息文本文件。
提取到的文本文件内容如下: