使用 rvest 抓取:获取错误 HTTP 502 [英] Scraping with rvest: Getting error HTTP 502

查看:35
本文介绍了使用 rvest 抓取:获取错误 HTTP 502的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 R 脚本,它使用 rvest 从 accuweather 中提取一些数据.accuweather URL 中包含与城市唯一对应的 ID.我正在尝试提取给定范围内的 ID 和关联的城市名称.rvest 本身适用于单个 ID,但是当我遍历 for 循环时,它最终会返回此错误 - "Error in open.connection(x, "rb") : HTTP 错误 502."

I have an R script that uses rvest to pull some data from accuweather. The accuweather URLs have IDs in them that uniquely correspond to cities. I'm trying to pull IDs in a given range and the associated City names. rvest itself works perfectly for a single ID, but when I iterate through a for loop it eventually returns this error - "Error in open.connection(x, "rb") : HTTP error 502."

我怀疑这个错误是由于网站阻止了我.我该如何解决这个问题?我想从相当大的范围(10,000 个 ID)中抓取,并且在循环约 500 次迭代后它一直给我这个错误.我也试过 closeAllConnections()Sys.sleep() 但无济于事.我真的很感激这个问题的任何帮助.

I suspect this error is due to the website blocking me out. How do I get around this? I want to scrape from quite a large range (10,000 IDs) and it keeps giving me this error after ~500 iterations of the loop. I also tried closeAllConnections() and Sys.sleep() but to no avail. I'd really appreciate any help with this problem.

已解决.我通过这里的线程找到了解决方法:使用tryCatch 在出错时跳到循环的下一个值?.我使用 tryCatch()error = function(e) e 作为参数,它抑制了错误消息并允许循环继续而不中断.希望这对遇到类似问题的其他人有所帮助.

Solved. I found a way around it through this thread here: Use tryCatch skip to next value of loop upon error?. I used tryCatch() with error = function(e) e as an argument and it suppressed the error message and allowed the loop to continue without breaking. Hopefully, this will be helpful to anyone else stuck on a similar problem.

library(rvest)
library(httr)

# create matrix to store IDs and Cities
# each ID corresponds to a single city 
id_mat<- matrix(0, ncol = 2, nrow = 10001 )

# initialize index for matrix row  
j = 1

for (i in 300000:310000){
  z <- as.character(i)
# pull city name from website 
  accu <- read_html(paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = ""))
  citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
  id_mat[j,1] = i
  id_mat[j,2] = citystate
# increment by 1 
  i = i + 1 
  j = j + 1
    # close connection after 200 pulls, wait 5 mins and loop again
    if (i %% 200 == 0) {
        closeAllConnections()
        Sys.sleep(300)
        next 
  } else {
        # sleep for 1 or 2 seconds every loop
        Sys.sleep(sample(2,1))
  }
}

推荐答案

问题似乎来自科学记数法.

The problem seems to be coming from scientific notation.

如何禁用科学记数法?

我稍微更改了您的代码,现在它似乎可以正常工作了:

I changed your code slightly, now it seems to be working:

library(rvest)
library(httr)

id_mat<- matrix(0, ncol = 2, nrow = 10001 )

readUrl <- function(url) {
out <- tryCatch(
{   
  download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
  return(1)
},
error=function(cond) {

  return(0)
},
warning=function(cond) {
  return(0)
}
)    
return(out)
}

j = 1

options(scipen = 999)

for (i in 300000:310000){
  z <- as.character(i)
# pull city name from website 
  url <- paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = "")
  if( readUrl(url)==1) {
  download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
  accu <- read_html("scrapedpage.html")
  citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
  id_mat[j,1] = i
  id_mat[j,2] = citystate
# increment by 1 
  i = i + 1 
  j = j + 1
    # close connection after 200 pulls, wait 5 mins and loop again
    if (i %% 200 == 0) {
        closeAllConnections()
        Sys.sleep(300)
        next 
  } else {
        # sleep for 1 or 2 seconds every loop
        Sys.sleep(sample(2,1))
  }
   } else {er <- 1}
  }

这篇关于使用 rvest 抓取:获取错误 HTTP 502的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆