使用 httr 进行网页抓取会导致 xml_nodeset 错误 [英] web scraping using httr give xml_nodeset error

查看:21
本文介绍了使用 httr 进行网页抓取会导致 xml_nodeset 错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试复制网址以进行网络抓取.它从给定的开始日期循环到结束日期,这是我的代码;

startDate <- as.Date("01-11-17", format="%d-%m-%y")endDate <- as.Date("31-01-18",format="%d-%m-%y")theDay <- startDatewhile (theDay <= endDate){dy <- as.character(theDay, format="%d")月 <- as.character(theDay, format = "%m")年 <- as.character(theDay, 格式 ="%Y")怀俄明州 <- "http://weather.uwyo.edu/cgi-bin/sounding?region=seasia&TYPE=TEXT%3ALIST&YEAR="地址 <- paste0(wyoming,year,"&MONTH=",month,"&FROM=",dy,"00&T0=",dy,"00&STNM=48657")打印(地址)theDay = theDay + 1}

我不太了解 html,但我喜欢这段代码

I try to reproduce web addresses for webscraping. It loop from given startDate to endDate, this is my code;

startDate <- as.Date("01-11-17", format="%d-%m-%y")
endDate <- as.Date("31-01-18",format="%d-%m-%y")
theDay <- startDate
while (theDay <= endDate)
{ 
  dy <- as.character(theDay, format="%d")
  month <- as.character(theDay, format = "%m")
  year <- as.character(theDay, format ="%Y")
  wyoming <- "http://weather.uwyo.edu/cgi-bin/sounding?region=seasia&TYPE=TEXT%3ALIST&YEAR="
  address <- paste0(wyoming,year,"&MONTH=",month,"&FROM=",dy,"00&T0=",dy,"00&STNM=48657")
  print(address)
 theDay = theDay + 1
} 

I don't understand html very well but I like how this code https://stackoverflow.com/a/52539658/7356308 turn the data into data frame which is simpler to work on later. It gathers the webpage response and extract the data into actual column names. Its work fine .. until I incorporate the looping task. Stating;

Error in wx_dat[[1]] : subscript out of bounds

Kindly advice on this... Thank you

library(httr)
library(rvest)
startDate <- as.Date("01-11-17", format="%d-%m-%y")
endDate <- as.Date("31-01-18",format="%d-%m-%y")

theDay <- startDate
while (theDay <= endDate)
{ 
  dy <- as.character(theDay, format="%d")
  month <- as.character(theDay, format = "%m")
  year <- as.character(theDay, format ="%Y")
  httr::GET(
    url = "http://weather.uwyo.edu/cgi-bin/sounding",
    query = list(
      region = "seasia",
      TYPE = "TEXT:list",
      YEAR = year,
      MONTH = month,
      FROM = paste0(dy,"00"), #is this the root of problem?
      STNM = "48657"
    )
  ) -> res

  #becoming html document
  httr::content(res, as="parsed") %>% html_nodes("pre")-> wx_dat
  #extract data
  html_text(wx_dat[[1]]) %>%           # turn the first <pre> node into text
    strsplit("\n") %>%                 # split it into lines
    unlist() %>%                       # turn it back into a character vector
    { col_names <<- .[3]; . } %>%      # pull out the column names
    .[-(1:5)] %>%                      # strip off the header
    paste0(collapse="\n") -> readings  # turn it back into a big text blob
  readr::read_table(readings, col_names = tolower(unlist(strsplit(trimws(col_names),"\ +"))))

  #data <- read_table(readings, col_names = tolower(unlist(strsplit(trimws(col_names),"\ +"))))
  #to write csv..
  print(theDay)
  theDay = theDay + 1
}

解决方案

I've encapsulated the function into a non-CRAN package. You can:

devtools::install_git("https://gitlab.com/hrbrmstr/unsound.git")

then:

library(unsound)
library(magick)
library(tidyverse)

startDate <- as.Date("01-11-17", format="%d-%m-%y")
endDate <- as.Date("31-01-18",format="%d-%m-%y")

# make a sequence
days <- seq(startDate, endDate, "1 day")

# apply the sequence — note that I am not going to hit the server >80x for 
# an example and *you* should add a Sys.sleep(5) before the call to 
# get_sounding_data() to be kind to their servers.
lapply(days[1:4], function(day) {
  get_sounding_data(
    region = "seasia",
    date = day,
    from_hr = "00",
    to_hr = "00",
    station_number = "48657"
  )
}) -> soundings_48657
## Warning message:
## In get_sounding_data(region = "seasia", date = day, from_hr = "00",  :
##   Can't get 48657 WMKD Kuantan Observations at 00Z 01 Nov 2017.

rbind_soundings(soundings_48657)
## # A tibble: 176 x 14
##    pres_hpa hght_m temp_c dwpt_c relh_pct mixr_g_kg drct_deg sknt_knot
##       <dbl>  <dbl>  <dbl>  <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
##  1    1006.    16.   24.0   23.4      96.      18.4       0.        0.
##  2    1000.    70.   23.6   22.4      93.      17.4       0.        0.
##  3     993.   132.   23.2   21.5      90.      16.6      NA        NA 
##  4     981.   238.   24.6   21.6      83.      16.9      NA        NA 
##  5    1005.    16.   24.2   23.6      96.      18.6     190.        1.
##  6    1000.    62.   24.2   23.1      94.      18.2     210.        3.
##  7     991.   141.   24.0   22.9      94.      18.1     212.        6.
##  8     983.   213.   23.8   22.7      94.      18.0     213.        8.
##  9     973.   302.   23.3   22.0      92.      17.4     215.       11.
## 10     970.   329.   23.2   21.8      92.      17.3     215.       11.
## # ... with 166 more rows, and 6 more variables: thta_k <dbl>,
## #   thte_k <dbl>, thtv_k <dbl>, date <date>, from_hr <chr>, to_hr <chr>

I also added a function to retrieve the pre-generated maps:

get_sounding_map(
  station_number = "48657", 
  date = Sys.Date()-1, 
  map_type = "skewt", 
  map_format = "gif", 
  region = "seasia", 
  from_hr = "00", 
  to_hr = "00"
)

这篇关于使用 httr 进行网页抓取会导致 xml_nodeset 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆