使用 httr 进行网页抓取会给出 xml_nodeset 错误 [英] web scraping using httr give xml_nodeset error

查看:10
本文介绍了使用 httr 进行网页抓取会给出 xml_nodeset 错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试复制网址以进行网络抓取.它从给定的 startDate 循环到 endDate,这是我的代码;

startDate <- as.Date("01-11-17", format="%d-%m-%y")endDate <- as.Date("31-01-18",format="%d-%m-%y")theDay <- 开始日期while (theDay <= endDate){dy <- as.character(theDay, format="%d")月 <- as.character(theDay, format = "%m")年 <- as.character(theDay, format ="%Y")怀俄明 <- "http://weather.uwyo.edu/cgi-bin/sounding?region=seasia&TYPE=TEXT%3ALIST&YEAR="地址 <- paste0(怀俄明州,year,"&MONTH=",month,"&FROM=",dy,"00&T0=",dy,"00&STNM=48657")打印(地址)theDay = theDay + 1}

我不太了解 html,但我喜欢这段代码 https://stackoverflow.com/a/52539658/7356308 将数据转换为数据框,以后处理起来更简单.它收集网页响应并将数据提取到实际的列名中.它工作正常.. 直到我合并循环任务.陈述;

wx_dat[[1]] 中的错误:下标越界

请就此提出建议...谢谢

库(httr)图书馆(rvest)startDate <- as.Date("01-11-17", format="%d-%m-%y")endDate <- as.Date("31-01-18",format="%d-%m-%y")theDay <- 开始日期while (theDay <= endDate){dy <- as.character(theDay, format="%d")月 <- as.character(theDay, format = "%m")年 <- as.character(theDay, format ="%Y")httr::GET(url = "http://weather.uwyo.edu/cgi-bin/sounding",查询=列表(地区=东南亚",TYPE = "TEXT:list",年 = 年,月 = 月,FROM = paste0(dy,"00"), #这是问题的根源吗?STNM = "48657")) ->资源#成为html文档httr::content(res, as="parsed") %>% html_nodes("pre")->wx_dat#提取数据html_text(wx_dat[[1]]) %>% #转第一个<pre>节点到文本strsplit("
") %>% # 分割成行unlist() %>% # 转回字符向量{ col_names <<- .[3];.} %>% # 拉出列名.[-(1:5)] %>% # 去掉标题paste0(collapse="
") ->readings # 把它变回一个大的文本块readr::read_table(读数,col_names = tolower(unlist(strsplit(trimws(col_names)," +"))))#data <- read_table(读数,col_names = tolower(unlist(strsplit(trimws(col_names)," +"))))#写csv..打印(当天)theDay = theDay + 1}

解决方案

我已经把函数封装成一个

I try to reproduce web addresses for webscraping. It loop from given startDate to endDate, this is my code;

startDate <- as.Date("01-11-17", format="%d-%m-%y")
endDate <- as.Date("31-01-18",format="%d-%m-%y")
theDay <- startDate
while (theDay <= endDate)
{ 
  dy <- as.character(theDay, format="%d")
  month <- as.character(theDay, format = "%m")
  year <- as.character(theDay, format ="%Y")
  wyoming <- "http://weather.uwyo.edu/cgi-bin/sounding?region=seasia&TYPE=TEXT%3ALIST&YEAR="
  address <- paste0(wyoming,year,"&MONTH=",month,"&FROM=",dy,"00&T0=",dy,"00&STNM=48657")
  print(address)
 theDay = theDay + 1
} 

I don't understand html very well but I like how this code https://stackoverflow.com/a/52539658/7356308 turn the data into data frame which is simpler to work on later. It gathers the webpage response and extract the data into actual column names. Its work fine .. until I incorporate the looping task. Stating;

Error in wx_dat[[1]] : subscript out of bounds

Kindly advice on this... Thank you

library(httr)
library(rvest)
startDate <- as.Date("01-11-17", format="%d-%m-%y")
endDate <- as.Date("31-01-18",format="%d-%m-%y")

theDay <- startDate
while (theDay <= endDate)
{ 
  dy <- as.character(theDay, format="%d")
  month <- as.character(theDay, format = "%m")
  year <- as.character(theDay, format ="%Y")
  httr::GET(
    url = "http://weather.uwyo.edu/cgi-bin/sounding",
    query = list(
      region = "seasia",
      TYPE = "TEXT:list",
      YEAR = year,
      MONTH = month,
      FROM = paste0(dy,"00"), #is this the root of problem?
      STNM = "48657"
    )
  ) -> res

  #becoming html document
  httr::content(res, as="parsed") %>% html_nodes("pre")-> wx_dat
  #extract data
  html_text(wx_dat[[1]]) %>%           # turn the first <pre> node into text
    strsplit("
") %>%                 # split it into lines
    unlist() %>%                       # turn it back into a character vector
    { col_names <<- .[3]; . } %>%      # pull out the column names
    .[-(1:5)] %>%                      # strip off the header
    paste0(collapse="
") -> readings  # turn it back into a big text blob
  readr::read_table(readings, col_names = tolower(unlist(strsplit(trimws(col_names)," +"))))

  #data <- read_table(readings, col_names = tolower(unlist(strsplit(trimws(col_names)," +"))))
  #to write csv..
  print(theDay)
  theDay = theDay + 1
}

解决方案

I've encapsulated the function into a non-CRAN package. You can:

devtools::install_git("https://gitlab.com/hrbrmstr/unsound.git")

then:

library(unsound)
library(magick)
library(tidyverse)

startDate <- as.Date("01-11-17", format="%d-%m-%y")
endDate <- as.Date("31-01-18",format="%d-%m-%y")

# make a sequence
days <- seq(startDate, endDate, "1 day")

# apply the sequence — note that I am not going to hit the server >80x for 
# an example and *you* should add a Sys.sleep(5) before the call to 
# get_sounding_data() to be kind to their servers.
lapply(days[1:4], function(day) {
  get_sounding_data(
    region = "seasia",
    date = day,
    from_hr = "00",
    to_hr = "00",
    station_number = "48657"
  )
}) -> soundings_48657
## Warning message:
## In get_sounding_data(region = "seasia", date = day, from_hr = "00",  :
##   Can't get 48657 WMKD Kuantan Observations at 00Z 01 Nov 2017.

rbind_soundings(soundings_48657)
## # A tibble: 176 x 14
##    pres_hpa hght_m temp_c dwpt_c relh_pct mixr_g_kg drct_deg sknt_knot
##       <dbl>  <dbl>  <dbl>  <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
##  1    1006.    16.   24.0   23.4      96.      18.4       0.        0.
##  2    1000.    70.   23.6   22.4      93.      17.4       0.        0.
##  3     993.   132.   23.2   21.5      90.      16.6      NA        NA 
##  4     981.   238.   24.6   21.6      83.      16.9      NA        NA 
##  5    1005.    16.   24.2   23.6      96.      18.6     190.        1.
##  6    1000.    62.   24.2   23.1      94.      18.2     210.        3.
##  7     991.   141.   24.0   22.9      94.      18.1     212.        6.
##  8     983.   213.   23.8   22.7      94.      18.0     213.        8.
##  9     973.   302.   23.3   22.0      92.      17.4     215.       11.
## 10     970.   329.   23.2   21.8      92.      17.3     215.       11.
## # ... with 166 more rows, and 6 more variables: thta_k <dbl>,
## #   thte_k <dbl>, thtv_k <dbl>, date <date>, from_hr <chr>, to_hr <chr>

I also added a function to retrieve the pre-generated maps:

get_sounding_map(
  station_number = "48657", 
  date = Sys.Date()-1, 
  map_type = "skewt", 
  map_format = "gif", 
  region = "seasia", 
  from_hr = "00", 
  to_hr = "00"
)

这篇关于使用 httr 进行网页抓取会给出 xml_nodeset 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆