R 中的昏迷网络爬虫(带 rvest) [英] Comatose web crawler in R (w/ rvest)

查看:28
本文介绍了R 中的昏迷网络爬虫(带 rvest)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在 R 中发现了 rvest 包,并决定尝试一些网页抓取.

I recently discovered the rvest package in R and decided to try out some web scraping.

我在一个函数中编写了一个小型网络爬虫,以便我可以将其通过管道进行清理等.

I wrote a small web crawler in a function so I could pipe it down to clean it up etc.

使用一个小的 url 列表(例如 1-100),该函数可以正常工作,但是当使用更大的列表时,该函数会在某些时候挂起.似乎其中一个命令正在等待响应,但似乎没有得到响应,也不会导致错误.

With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error.

urlscrape<-function(url_list) {

library(rvest)
library(dplyr)
assets<-NA
price<-NA
description<-NA
city<-NA
length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)


for (i in 1:n) {
#scraping for price#
try( {read_html(url_list[i]) %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)

#scraping for city#
try( {read_html(url_list[i]) %>% html_node(".city") %>% html_text()->city[i]}, silent=TRUE)

#scraping for description#
try( {read_html(url_list[i]) %>% html_nodes("h1") %>% html_text() %>% paste(collapse=" ") ->description[i]}, silent=TRUE)

#scraping for assets#
try( {read_html(url_list[i]) %>% html_nodes(".assets>li") %>% html_text() %>% paste(collapse=" ") ->assets[i]}, silent=TRUE)

Sys.sleep(2)
setTxtProgressBar(pb, i)
}


Sys.time()->time
print("")
paste("Finished at",time) %>% print()
print("")
return(as.data.frame(cbind(price,city,description,assets)) )
}

(1) 在不知道确切问题的情况下,我在 rvest 包中寻找超时选项但无济于事.然后我尝试使用 httr 包中的超时选项(结果仍然是控制台挂起).对于.price",它会变成:

(1) Without knowing the exact problem I looked for a timeout option in the rvest package with no avail. I then tried to use the timeout option in the httr package (with still console hanging as a result). For ".price" it would become:

content(GET(url_list[i], timeout=(10)), timeout=(10), as="text") %>% read_html() %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)

我想到了其他解决方案并尝试实施它们,但没有奏效.

I thought of other solutions and tried to implement them, but it did not work.

(2) 带有 setTimeLimit 的时间限制:

(2) Timelimit with setTimeLimit:

length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)
setTimeLimit(elapsed=20)

(3) 测试 url 是否成功,c 在第 4 次抓取后增加:

(3) Test for url succes, with c increasing after the 4th scrape:

for (i in 1:n) {
        while(url_success(url_list[i])==TRUE & c==i) {

它都不起作用,因此当 url 列表很大时,该功能仍然挂起.问题:为什么控制台会挂,如何解决?感谢阅读.

None of it worked and thus the function still hangs when the url list is big. Question: why would the console hang and how could it be solved? Thanks for reading.

推荐答案

不幸的是,上述解决方案都不适合我.一些 URL 会冻结 R-Script,无论是来自 rvest 的 read_html(..)、来自 RCurl 的 GET(..)、getUrl(..) 还是 getUrlContent(..).

Unfortunatley, none of the above solutions worked for me. Some URL's freeze up R-Script, no matter if its with read_html(..) from rvest, GET(..), getUrl(..) or getUrlContent(..) from RCurl.

唯一对我有用的解决方案是 R.utils 中的 evalWithTimeout 和 tryCatchBlock 的组合:

The only solution that worked for me is a combination of evalWithTimeout from R.utils and a tryCatchBlock:

# install.packages("R.utils")
# install.packages("rvest")
library(R.utils)
library(rvest)
pageIsBroken = FALSE

url = "http://www.detecon.com/de/bewerbungsformular?job-title=berater+f%c3%bcr+%e2%80%9cdigital+transformation%e2%80%9d+(m/w)"

page = tryCatch(

  evalWithTimeout({ read_html(url, encoding="UTF-8") }, timeout = 5),

  error = function(e) {
    pageIsBroken <<- TRUE; 
    return(e)
  }
)

if (pageIsBroken) {
  print(paste("Error Msg:", toString(page)))
}

这篇关于R 中的昏迷网络爬虫(带 rvest)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆