迭代rvest scrape函数给出:"open.connection(x，"rb")中的错误:已达到超时". [英] Iterating rvest scrape function gives: "Error in open.connection(x, "rb") : Timeout was reached"

查看：135 发布时间：2020/11/11 2:46:30 r function web-scraping rvest

本文介绍了迭代rvest scrape函数给出:"open.connection(x，"rb")中的错误:已达到超时".的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用"rvest"软件包抓取此网站.当我多次迭代函数时，我收到"open.connection(x，"rb")中的错误:已达到超时".我搜索了类似的问题，但答案似乎导致死胡同.我怀疑它是服务器端，并且该网站对我可以访问该页面的次数有内置限制.如何研究这个假设?

I'm scraping this website using the "rvest"-package. When I iterate my function too many times I get "Error in open.connection(x, "rb") : Timeout was reached". I have searched for similar questions but the answers seems to lead to dead ends. I have a suspicion that it is server side and the website has a build-in restriction on how many times I can visit the page. How do investigate this hypothesis?

代码:我具有指向基础网页的链接，并希望使用从关联网页中提取的信息来构建数据框.我已经简化了我的抓取功能，因为使用更简单的功能仍然会出现问题:

The code: I have the links to the underlying web pages and want to construct a data frame with the information extracted from the associated web pages. I have simplified my scraping function a bit as the problem is still occurring with a simpler function:

scrape_test = function(link) {

  slit <-  str_split(link, "/") %>%
    unlist()
  id <- slit[5]
  sem <- slit[6]

  name <- link %>% 
    read_html(encoding = "UTF-8") %>%
    html_nodes("h2") %>%
    html_text() %>%
    str_replace_all("\r\n", "") %>%
    str_trim()

  return(data.frame(id, sem, name))
}

我使用purrr包map_df()来迭代该函数:

I use the purrr-package map_df() to iterate the function:

test.data = links %>%
  map_df(scrape_test)

现在，如果仅使用50个链接来迭代该函数，则不会收到任何错误.但是，当我增加链接数时，遇到了前面提到的错误.此外，我收到以下警告:

Now, if I iterate the function using only 50 links I receive no error. But when I increase the number of links I encounter the before-mentioned error. Furthermore I get the following warnings:

在bind_rows_(x，.id)中:不相等的因子水平:强迫字符"
关闭未使用的连接4( link )"

"In bind_rows_(x, .id) : Unequal factor levels: coercing to character"
"closing unused connection 4 (link)"

编辑:以下使链接成为对象的代码可用于重现我的结果:

The following code making an object of links can be used to reproduce my results:

links <- c(rep("http://karakterstatistik.stads.ku.dk/Histogram/NMAK13032E/Winter-2013/B2", 100))

推荐答案

对于大型抓取任务，我通常会进行for循环，这有助于进行故障排除.为您的输出创建一个空列表:

With large scraping tasks I would usually do a for-loop, which helps with troubleshooting. Create an empty list for your output:

d <- vector("list", length(links))

在这里，我使用tryCatch块进行了for循环，因此，如果输出出现错误，我们将等待几秒钟，然后重试.如果在五次尝试后仍然出现错误，我们还包括一个counter，它会移至下一个链接.此外，我们还有if (!(links[i] %in% names(d)))，以便在必须中断循环时，可以跳过在重新启动循环时已经抓取的链接.

Here I do a for-loop, with a tryCatch block so that if the output is an error, we wait a couple of seconds and try again. We also include a counter that moves on to the next link if we're still getting an error after five attempts. In addition, we have if (!(links[i] %in% names(d))) so that if we have to break the loop, we can skip the links we've already scraped when we restart the loop.

for (i in seq_along(links)) {
  if (!(links[i] %in% names(d))) {
    cat(paste("Doing", links[i], "..."))
    ok <- FALSE
    counter <- 0
    while (ok == FALSE & counter <= 5) {
      counter <- counter + 1
      out <- tryCatch({                  
                  scrape_test(links[i])
                },
                error = function(e) {
                  Sys.sleep(2)
                  e
                }
              )
      if ("error" %in% class(out)) {
        cat(".")
      } else {
        ok <- TRUE
        cat(" Done.")
      }
    }
    cat("\n")
    d[[i]] <- out
    names(d)[i] <- links[i]
  }
}

这篇关于迭代rvest scrape函数给出:"open.connection(x，"rb")中的错误:已达到超时".的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

迭代rvest scrape函数给出:"open.connection(x，"rb")中的错误:已达到超时". [英] Iterating rvest scrape function gives: "Error in open.connection(x, "rb") : Timeout was reached"

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

迭代rvest scrape函数给出:"open.connection(x，"rb")中的错误:已达到超时". [英] Iterating rvest scrape function gives: &quot;Error in open.connection(x, &quot;rb&quot;) : Timeout was reached&quot;

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

迭代rvest scrape函数给出:"open.connection(x，"rb")中的错误:已达到超时". [英] Iterating rvest scrape function gives: "Error in open.connection(x, "rb") : Timeout was reached"

登录关闭