R 中的昏迷网络爬虫(带 rvest) [英] Comatose web crawler in R (w/ rvest)

查看：28 发布时间：2021/7/14 18:36:41 r web-crawler rvest

本文介绍了R 中的昏迷网络爬虫(带 rvest)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我最近在 R 中发现了 rvest 包，并决定尝试一些网页抓取.

I recently discovered the rvest package in R and decided to try out some web scraping.

我在一个函数中编写了一个小型网络爬虫，以便我可以将其通过管道进行清理等.

I wrote a small web crawler in a function so I could pipe it down to clean it up etc.

使用一个小的 url 列表(例如 1-100)，该函数可以正常工作，但是当使用更大的列表时，该函数会在某些时候挂起.似乎其中一个命令正在等待响应，但似乎没有得到响应，也不会导致错误.

With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error.

urlscrape<-function(url_list) {

library(rvest)
library(dplyr)
assets<-NA
price<-NA
description<-NA
city<-NA
length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)


for (i in 1:n) {
#scraping for price#
try( {read_html(url_list[i]) %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)

#scraping for city#
try( {read_html(url_list[i]) %>% html_node(".city") %>% html_text()->city[i]}, silent=TRUE)

#scraping for description#
try( {read_html(url_list[i]) %>% html_nodes("h1") %>% html_text() %>% paste(collapse=" ") ->description[i]}, silent=TRUE)

#scraping for assets#
try( {read_html(url_list[i]) %>% html_nodes(".assets>li") %>% html_text() %>% paste(collapse=" ") ->assets[i]}, silent=TRUE)

Sys.sleep(2)
setTxtProgressBar(pb, i)
}


Sys.time()->time
print("")
paste("Finished at",time) %>% print()
print("")
return(as.data.frame(cbind(price,city,description,assets)) )
}

(1) 在不知道确切问题的情况下，我在 rvest 包中寻找超时选项但无济于事.然后我尝试使用 httr 包中的超时选项(结果仍然是控制台挂起).对于.price"，它会变成:

(1) Without knowing the exact problem I looked for a timeout option in the rvest package with no avail. I then tried to use the timeout option in the httr package (with still console hanging as a result). For ".price" it would become:

content(GET(url_list[i], timeout=(10)), timeout=(10), as="text") %>% read_html() %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)

我想到了其他解决方案并尝试实施它们，但没有奏效.

I thought of other solutions and tried to implement them, but it did not work.

(2) 带有 setTimeLimit 的时间限制:

(2) Timelimit with setTimeLimit:

length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)
setTimeLimit(elapsed=20)

(3) 测试 url 是否成功，c 在第 4 次抓取后增加:

(3) Test for url succes, with c increasing after the 4th scrape:

for (i in 1:n) {
        while(url_success(url_list[i])==TRUE & c==i) {

它都不起作用，因此当 url 列表很大时，该功能仍然挂起.问题:为什么控制台会挂，如何解决?感谢阅读.

None of it worked and thus the function still hangs when the url list is big. Question: why would the console hang and how could it be solved? Thanks for reading.

推荐答案

不幸的是，上述解决方案都不适合我.一些 URL 会冻结 R-Script，无论是来自 rvest 的 read_html(..)、来自 RCurl 的 GET(..)、getUrl(..) 还是 getUrlContent(..).

Unfortunatley, none of the above solutions worked for me. Some URL's freeze up R-Script, no matter if its with read_html(..) from rvest, GET(..), getUrl(..) or getUrlContent(..) from RCurl.

唯一对我有用的解决方案是 R.utils 中的 evalWithTimeout 和 tryCatchBlock 的组合:

The only solution that worked for me is a combination of evalWithTimeout from R.utils and a tryCatchBlock:

# install.packages("R.utils")
# install.packages("rvest")
library(R.utils)
library(rvest)
pageIsBroken = FALSE

url = "http://www.detecon.com/de/bewerbungsformular?job-title=berater+f%c3%bcr+%e2%80%9cdigital+transformation%e2%80%9d+(m/w)"

page = tryCatch(

  evalWithTimeout({ read_html(url, encoding="UTF-8") }, timeout = 5),

  error = function(e) {
    pageIsBroken <<- TRUE; 
    return(e)
  }
)

if (pageIsBroken) {
  print(paste("Error Msg:", toString(page)))
}

r rscript rvest rcurl freezing web-scraping 连接超时 read-html evalwithtimeout

这篇关于R 中的昏迷网络爬虫(带 rvest)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R 中的昏迷网络爬虫(带 rvest) [英] Comatose web crawler in R (w/ rvest)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R 中的昏迷网络爬虫(带 rvest) [英] Comatose web crawler in R (w/ rvest)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭