如何在rvest中设置超时 [英] how to set timeout in rvest

查看:48
本文介绍了如何在rvest中设置超时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简单的问题:这段代码 x <- read_html(url) 挂起并无限秒地读取页面.我不知道如何处理这个问题,例如,通过设置一些最大响应时间.我可以使用 try、catch 或任何方法重试.但它只是挂起,什么也没有发生.有人知道怎么处理吗?

页面没有问题,有时会出现,当我手动重试时它可以工作.

解决方案

您可以将 read_html 包装在 httr 包中的 GET 函数中

例如如果您的原始代码是

库(rvest)图书馆(dplyr)my_url <- "https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest"x <- my_url %>% read_html(.)

然后你可以用

替换它

库(httr)# 允许 10 秒my_url %>% GET(., timeout(10)) %>% read_html# 允许 30 秒my_url %>% GET(., timeout(30)) %>% read_html

示例

要进行测试,请尝试设置极短的超时时间(例如百分之一秒)

# 允许不合理的短时间,以便请求错误而不是无限期挂起my_url %>% GET(., timeout(0.01)) %>% read_html# curl::curl_fetch_memory(url, handle = handle) 中的错误:# 已达到超时:解决 10 毫秒后超时

您可以在此处

找到更多示例

在循环中使用它(例如,'如果超时,则跳到下一个)

尝试运行此代码.它假设您有多个(在本例中为 3 个)要访问的 url(下面的第二个 url 将在提供 html 之前延迟 3 秒 - 一种测试您正在寻找的功能的好方法).我们将超时设置为 2 秒,因此我们知道它会失败.tryCatch() 函数将简单地执行您作为第二个参数放入的任何代码;在这种情况下,它将简单地分配超时!"到列表元素

<预><代码>my_urls <- c("https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest","http://httpbin.org/delay/3", #这个url会延迟3秒http://httpbin.org/delay/1")x <- 列表()# 将超时设置为 2 秒(因此第二个 url 将失败)for (i in 1:length(my_urls)) {打印(粘贴0(抓取网址号",我))tryCatch(x[[i]] <- my_urls[i] %>% GET(., timeout(2)) %>% read_html,error = function(e) { x[[i]] <<-超时!";})}

现在我们检查输出 - 第一个和第三个站点返回内容,第二个超时

# >X# [[1]]# {xml_document}# # [1] \n\r\n 

\r\n

# [1] <body><p>{\n "args": {}, \n "data": "", \n "files": {}, \n "form";: {}, \n "headers": {\n "Accept": ...

显然,您可以将超时值设置为您想要的任何值.30 - 60 秒可能是合理的,具体取决于使用情况.

Simple question: this code x <- read_html(url) hangs and reads page infinite amount of seconds. I don't know how to handle this, for example, by setting some maximum time for response. I could use try, catch, whatever to retry. But it just hangs and nothing happens. Anyone know how to deal with it?

There's no problem with page, it occurs sometimes, and while I retry manually it works.

解决方案

You can wrap read_html in the GET function from httr package

e.g. if your original code was

library(rvest)
library(dplyr)

my_url <- "https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest"
x <- my_url %>% read_html(.)

then you could replace it with

library(httr)

# Allow 10 seconds
my_url %>% GET(., timeout(10)) %>% read_html

# Allow 30 seconds
my_url %>% GET(., timeout(30)) %>% read_html

Example

To put it to the test, try setting an extremely short timeout period (e.g. a hundredth of a second)

# Allow an unreasonably short amount of time so the request errors rather than hangs indefinitely

my_url %>% GET(., timeout(0.01)) %>% read_html

# Error in curl::curl_fetch_memory(url, handle = handle) : 
#   Timeout was reached: Resolving timed out after 10 milliseconds

You can find some more examples here

Using it in a loop (e.g. 'skip to the next if timed out)

Try running this code. It supposes you have a number (3 in this case) of urls to visit (one the second url below will delay 3 seconds before providing the html - a great way to test the functionality you're looking for). We set the timeout for 2 seconds so we know it will fail. The tryCatch() function will simply execute whatever code you put in as its second argument; in this case it will simply assign 'Timed out!' to the list element


my_urls <- c("https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest",
             "http://httpbin.org/delay/3", # This url will delay 3 seconds
             "http://httpbin.org/delay/1") 

x <- list()

# Set timeout for 2 seconds (so second url will fail)
for (i in 1:length(my_urls)) {

  print(paste0("Scraping url number ", i))

  tryCatch(x[[i]] <- my_urls[i] %>% GET(., timeout(2)) %>% read_html,
           error = function(e) { x[[i]] <<- "Timed out!" } )
  
}

Now we inspect the output - the first and third sites returned content, the second timed out

# > x
# [[1]]
# {xml_document}
# <html itemscope="" itemtype="http://schema.org/QAPage" class="html__responsive">
#   [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title>r - how to set timeout ...
# [2] <body class="question-page unified-theme">\r\n    <div id="notify-container"></div>\r\n    <div id="custom ...
# 
# [[2]]
# [1] "Timed out!"
# 
# [[3]]
# {xml_document}
# <html>
# [1] <body><p>{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {}, \n  "headers": {\n    "Accept": ...


Obviously you can set the timeout value to whatever you want. 30 - 60 seconds could be sensible depending on the use.

这篇关于如何在rvest中设置超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆