这篇是很久之前学习r爬虫时写的,搬到这里来
格式转化
iconv(text,"UTF-8")
方法一,通过RCurl实现
正则表达式/xml
install.packages("RCurl")
install.packages("XML")
library(RCurl)
library(XML)
myHttpheader <- c(
"User-Agent"="Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ",
"Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language"="en-us",
"Connection"="keep-alive",
"Accept-Charset"="GB2312,utf-8;q=0.7,*;q=0.7")
url <- "https://book.douban.com/top250?icn=index-book250-all"
webpage <- getURL(url,httpheader=myHttpheader,.encoding="UTF-8")
pagetree <- htmlTreeParse(webpage,encoding="UTF-8", error=function(...){}, useInternalNodes = TRUE,trim=TRUE)
node<-getNodeSet(pagetree, "//p[@class='pl']/text()")
info<-sapply(node,xmlValue)
info
node
方法二,通过rvest实现
知识储备:css/xpath
install.packages("rvest")
library(rvest)
web<-read_html("https://book.douban.com/top250?icn=index-book250-all",encoding="UTF-8")
position<-web %>% html_nodes("p.pl") %>% html_text()
position
评价书
选取所有的评价
position2<-web %>% html_nodes("span.pl") %>% html_text()
position2<-web %>% html_nodes("div span.pl") %>% html_text()
选区所有的简介(2种写法)
position3<-web %>% html_nodes("p.quote") %>% html_text()
position3<-web %>% html_nodes("span.inq") %>% html_text()
选取所有的书名
position4<-web %>% html_nodes("a[title]") %>% html_text()
position5