R 中的昏迷网络爬虫(带 rvest) [英] Comatose web crawler in R (w/ rvest)
问题描述
我最近在 R 中发现了 rvest 包,并决定尝试一些网页抓取.
I recently discovered the rvest package in R and decided to try out some web scraping.
我在一个函数中编写了一个小型网络爬虫,以便我可以将其通过管道进行清理等.
I wrote a small web crawler in a function so I could pipe it down to clean it up etc.
使用一个小的 url 列表(例如 1-100),该函数可以正常工作,但是当使用更大的列表时,该函数会在某些时候挂起.似乎其中一个命令正在等待响应,但似乎没有得到响应,也不会导致错误.
With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error.
urlscrape<-function(url_list) {
library(rvest)
library(dplyr)
assets<-NA
price<-NA
description<-NA
city<-NA
length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)
for (i in 1:n) {
#scraping for price#
try( {read_html(url_list[i]) %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)
#scraping for city#
try( {read_html(url_list[i]) %>% html_node(".city") %>% html_text()->city[i]}, silent=TRUE)
#scraping for description#
try( {read_html(url_list[i]) %>% html_nodes("h1") %>% html_text() %>% paste(collapse=" ") ->description[i]}, silent=TRUE)
#scraping for assets#
try( {read_html(url_list[i]) %>% html_nodes(".assets>li") %>% html_text() %>% paste(collapse=" ") ->assets[i]}, silent=TRUE)
Sys.sleep(2)
setTxtProgressBar(pb, i)
}
Sys.time()->time
print("")
paste("Finished at",time) %>% print()
print("")
return(as.data.frame(cbind(price,city,description,assets)) )
}
(1) 在不知道确切问题的情况下,我在 rvest 包中寻找超时选项但无济于事.然后我尝试使用 httr 包中的超时选项(结果仍然是控制台挂起).对于.price",它会变成:
(1) Without knowing the exact problem I looked for a timeout option in the rvest package with no avail. I then tried to use the timeout option in the httr package (with still console hanging as a result). For ".price" it would become:
content(GET(url_list[i], timeout=(10)), timeout=(10), as="text") %>% read_html() %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)
我想到了其他解决方案并尝试实施它们,但没有奏效.
I thought of other solutions and tried to implement them, but it did not work.
(2) 带有 setTimeLimit 的时间限制:
(2) Timelimit with setTimeLimit:
length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)
setTimeLimit(elapsed=20)
(3) 测试 url 是否成功,c 在第 4 次抓取后增加:
(3) Test for url succes, with c increasing after the 4th scrape:
for (i in 1:n) {
while(url_success(url_list[i])==TRUE & c==i) {
它都不起作用,因此当 url 列表很大时,该功能仍然挂起.问题:为什么控制台会挂,如何解决?感谢阅读.
None of it worked and thus the function still hangs when the url list is big. Question: why would the console hang and how could it be solved? Thanks for reading.
推荐答案
不幸的是,上述解决方案都不适合我.一些 URL 会冻结 R-Script,无论是来自 rvest 的 read_html(..)、来自 RCurl 的 GET(..)、getUrl(..) 还是 getUrlContent(..).
Unfortunatley, none of the above solutions worked for me. Some URL's freeze up R-Script, no matter if its with read_html(..) from rvest, GET(..), getUrl(..) or getUrlContent(..) from RCurl.
唯一对我有用的解决方案是 R.utils 中的 evalWithTimeout 和 tryCatchBlock 的组合:
The only solution that worked for me is a combination of evalWithTimeout from R.utils and a tryCatchBlock:
# install.packages("R.utils")
# install.packages("rvest")
library(R.utils)
library(rvest)
pageIsBroken = FALSE
url = "http://www.detecon.com/de/bewerbungsformular?job-title=berater+f%c3%bcr+%e2%80%9cdigital+transformation%e2%80%9d+(m/w)"
page = tryCatch(
evalWithTimeout({ read_html(url, encoding="UTF-8") }, timeout = 5),
error = function(e) {
pageIsBroken <<- TRUE;
return(e)
}
)
if (pageIsBroken) {
print(paste("Error Msg:", toString(page)))
}
rrscriptrvestrcurlfreezingweb-scraping连接超时read-htmlevalwithtimeout
这篇关于R 中的昏迷网络爬虫(带 rvest)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!