从网络抓取工作中将数据框转换为正确的格式 [英] Get data frame into right format from web-scraping work

查看:21
本文介绍了从网络抓取工作中将数据框转换为正确的格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有代码用于通过将 httr 包装在函数中来反复抓取过去的空气大气数据.原始代码在循环任务上运行良好.您可以在此处https://stackoverflow.com/a/52545775/7356308找到原始代码.我对其进行了一些修改以抓取网站中的不同部分.不幸的是它没有返回正确的格式,尤其是观察时间.

I have code which I use to web scrape past air atmosphere data repeatedly by wrapping the httr in the function.The original code works well on looping task. You may find the original code here https://stackoverflow.com/a/52545775/7356308. I modified it a bit to web-scrape different part in the website. Unfortunately it didn't return right format especially the observation time.

#' @param region one of "`naconf`", "`samer`", "`pac`", "`nz`", "`ant`", "`np`",
#'        "`europe`", "`africa`", "`seasia`", "`mideast`" (which matches the
#'        values of the drop-down menu on the site)
#' @param date an ISO character string (e.g. `YYYY-mm-dd`) or a valid `Date` object
#' @param from_hr,to_hr one of `00` (or `0`), `12` or `all`; if `all` then both
#'        values will be set to `all`
#' @param station_number the station number
#' @return data frame
#' @export
get_sounding_data <- function(region = c("naconf", "samer", "pac", "nz", "ant",
                                     "np", "europe", "africa", "seasia", "mideast"),
                          date,
                          from_hr = c("00", "12", "all"),
                          to_hr = c("00", "12", "all"),
                          station_number = 48615) {

  #  removed the readr and dplyr dependencies by using these packages.
  suppressPackageStartupMessages({
    require("xml2", quietly = TRUE)
    require("httr", quietly = TRUE)
    require("rvest", quietly = TRUE)
  })

  # validate region
  region <- match.arg(
    arg = region,
    choices = c(
  "naconf", "samer", "pac", "nz", "ant",
  "np", "europe", "africa", "seasia", "mideast"
)
  )

  # this actually validates the date for us if it's a character string
  date <- as.Date(date)

  # get year and month
  year <- as.integer(format(date, "%Y"))
  stopifnot(year %in% 1973:as.integer(format(Sys.Date(), "%Y")))

  year <- as.character(year)
  month <- format(date, "%m")

  # we need these to translate day & *_hr to the param the app needs
  c(
"0100", "0112", "0200", "0212", "0300", "0312", "0400", "0412",
"0500", "0512", "0600", "0612", "0700", "0712", "0800", "0812",
"0900", "0912", "1000", "1012", "1100", "1112", "1200", "1212",
"1300", "1312", "1400", "1412", "1500", "1512", "1600", "1612",
"1700", "1712", "1800", "1812", "1900", "1912", "2000", "2012",
"2100", "2112", "2200", "2212", "2300", "2312", "2400", "2412",
"2500", "2512", "2600", "2612", "2700", "2712", "2800", "2812",
"2900", "2912", "3000", "3012", "3100", "3112"
  ) -> hr_vals

  c(
"01/00Z", "01/12Z", "02/00Z", "02/12Z", "03/00Z", "03/12Z", "04/00Z",
"04/12Z", "05/00Z", "05/12Z", "06/00Z", "06/12Z", "07/00Z", "07/12Z",
"08/00Z", "08/12Z", "09/00Z", "09/12Z", "10/00Z", "10/12Z", "11/00Z",
"11/12Z", "12/00Z", "12/12Z", "13/00Z", "13/12Z", "14/00Z", "14/12Z",
"15/00Z", "15/12Z", "16/00Z", "16/12Z", "17/00Z", "17/12Z", "18/00Z",
"18/12Z", "19/00Z", "19/12Z", "20/00Z", "20/12Z", "21/00Z", "21/12Z",
"22/00Z", "22/12Z", "23/00Z", "23/12Z", "24/00Z", "24/12Z", "25/00Z",
"25/12Z", "26/00Z", "26/12Z", "27/00Z", "27/12Z", "28/00Z", "28/12Z",
"29/00Z", "29/12Z", "30/00Z", "30/12Z", "31/00Z", "31/12Z"
  ) -> hr_inputs

  hr_trans <- stats::setNames(hr_vals, hr_inputs)

 o_from_hr <- from_hr <- as.character(tolower(from_hr))
 o_to_hr <- to_hr <- as.character(tolower(to_hr))

if ((from_hr == "all") || (to_hr == "all")) {
from_hr <- to_hr <- "all"
 } else {

from_hr <- hr_trans[sprintf("%s/%02dZ", format(date, "%d"), as.integer(from_hr))]
match.arg(from_hr, hr_vals)

to_hr <- hr_trans[sprintf("%s/%02dZ", format(date, "%d"), as.integer(to_hr))]
match.arg(to_hr, hr_vals)

}

# clean up the station number if it was entered as a double
station_number <- as.character(as.integer(station_number))

# execute the API call
httr::GET(
url = "http://weather.uwyo.edu/cgi-bin/sounding",
query = list(
  region = region,
  TYPE = "TEXT:LIST",
  YEAR = year,
  MONTH = sprintf("%02d", as.integer(month)),
  FROM = from_hr,
  TO = to_hr,
  STNM = station_number
 )
) -> res

# check for super bad errors (that we can't handle nicely)
 httr::stop_for_status(res)

# get the page content
 doc <- httr::content(res, as="text")

# if the site reports no data, issue a warning and return an empty data frame
if (grepl("Can't get", doc)) {
 doc <- xml2::read_html(doc)
 msg <- rvest::html_nodes(doc, "body")
 msg <- rvest::html_text(msg, trim=TRUE)
 msg <- gsub("\n\n+.*$", "", msg)
 warning(msg)
 return(data.frame(stringsAsFactors=FALSE))
  }

# if the site reports no data, issue a warning and return an empty data frame
if (grepl("Can't get", doc)) {
doc <- xml2::read_html(doc)
msg <- rvest::html_nodes(doc, "body")
msg <- rvest::html_text(msg, trim=TRUE)
msg <- gsub("\n\n+.*$", "", msg)
warning(msg)
return(data.frame(stringsAsFactors=FALSE))
}

# turn it into something we can parse
doc <- xml2::read_html(doc)

# get the metadata
meta <- rvest::html_node(doc, "h3")
meta <- rvest::html_text(meta, trim=TRUE)

   # get the table 
 ##################### my modification #######################
  raw_dat <- doc %>%
html_nodes("h3+ pre") %>% 
html_text()

  indices <- raw_dat %>% 
str_split(pattern = "\n", simplify = T) %>% 
map_chr(str_squish) %>% 
tibble(x = .) %>% 
separate(x, into = c("Station", "Value"), sep = ": ") %>% 
filter(!is.na(Value))

  data <- tidyr::spread(indices, Station, Value)
 data
 }
##############################################

startDate <- as.Date("01-11-17", format="%d-%m-%y")
endDate <- as.Date("04-11-17",format="%d-%m-%y")

days <- seq(startDate, endDate, "1 day")

lapply(days[1:4], function(day) {
  get_sounding_data(
region = "seasia",
date = day,
from_hr = "00",
to_hr = "00",
station_number = "48615"
)
   }) -> soundings_48615

  #If a station had no data for a particular day there will be warnings about it so you can do this to check how many days are missing due to no data being present.

warnings()
## Warning message:
## In get_sounding_data(region = "seasia", date = day, from_hr = "00",  :
##   Can't get 48615 WMKD Kota Bharu Observations at 00Z 01 Nov 2017.

  str(soundings_48615, 2)
List of 4
 $ :'data.frame':   0 obs. of  0 variables
 $ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  30 variables:
  ..$ 1000 hPa to 500 hPa thickness              : chr "5782.00"
  ..$ Bulk Richardson Number                     : chr "240.00"
  ..$ Bulk Richardson Number using CAPV          : chr "349.48"
  ..$ CAPE using virtual temperature             : chr "595.76"
  ..$ CINS using virtual temperature             : chr "-8.60"
  ..$ Convective Available Potential Energy      : chr "409.13"
  ..$ Convective Inhibition                      : chr "-26.90"
  ..$ Cross totals index                         : chr "19.00"
  ..$ Equilibrum Level                           : chr "228.72"
  ..$ Equilibrum Level using virtual temperature : chr "226.79"
  ..$ K index                                    : chr "14.40"
  ..$ Level of Free Convection                   : chr "819.49"
  ..$ LFCT using virtual temperature             : chr "871.25"
  ..$ LIFT computed using virtual temperature    : chr "-3.38"
  ..$ Lifted index                               : chr "-2.86"
  ..$ Mean mixed layer mixing ratio              : chr "17.45"
  ..$ Mean mixed layer potential temperature     : chr "299.97"
  ..$ Observation time                           : chr "190120/1200"
  ..$ Precipitable water [mm] for entire sounding: chr "46.56"
  ..$ Pres [hPa] of the Lifted Condensation Level: chr "938.33"
  ..$ Showalter index                            : chr "1.26"
  ..$ Station elevation                          : chr "5.0"
  ..$ Station identifier                         : chr "WMKC"
  ..$ Station latitude                           : chr "6.16"
  ..$ Station longitude                          : chr "102.28"
  ..$ Station number                             : chr "48615"
  ..$ SWEAT index                                : chr "187.99"
  ..$ Temp [K] of the Lifted Condensation Level  : chr "294.55"
  ..$ Totals totals index                        : chr "42.90"
  ..$ Vertical totals index                      : chr "23.90"
 $ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  30 variables:
  ..$ 1000 hPa to 500 hPa thickness              : chr "5782.00"
  ..$ Bulk Richardson Number                     : chr "240.00"
  ..$ Bulk Richardson Number using CAPV          : chr "349.48"
  ..$ CAPE using virtual temperature             : chr "595.76"
  ..$ CINS using virtual temperature             : chr "-8.60"
  ..$ Convective Available Potential Energy      : chr "409.13"
  ..$ Convective Inhibition                      : chr "-26.90"
  ..$ Cross totals index                         : chr "19.00"
  ..$ Equilibrum Level                           : chr "228.72"
  ..$ Equilibrum Level using virtual temperature : chr "226.79"
  ..$ K index                                    : chr "14.40"
  ..$ Level of Free Convection                   : chr "819.49"
  ..$ LFCT using virtual temperature             : chr "871.25"
  ..$ LIFT computed using virtual temperature    : chr "-3.38"
  ..$ Lifted index                               : chr "-2.86"
  ..$ Mean mixed layer mixing ratio              : chr "17.45"
  ..$ Mean mixed layer potential temperature     : chr "299.97"
  ..$ Observation time                           : chr "190120/1200"
  ..$ Precipitable water [mm] for entire sounding: chr "46.56"
  ..$ Pres [hPa] of the Lifted Condensation Level: chr "938.33"
  ..$ Showalter index                            : chr "1.26"
  ..$ Station elevation                          : chr "5.0"
  ..$ Station identifier                         : chr "WMKC"
  ..$ Station latitude                           : chr "6.16"
  ..$ Station longitude                          : chr "102.28"
  ..$ Station number                             : chr "48615"
  ..$ SWEAT index                                : chr "187.99"
  ..$ Temp [K] of the Lifted Condensation Level  : chr "294.55"
  ..$ Totals totals index                        : chr "42.90"
  ..$ Vertical totals index                      : chr "23.90"
 $ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  30 variables:
  ..$ 1000 hPa to 500 hPa thickness              : chr "5782.00"
  ..$ Bulk Richardson Number                     : chr "240.00"
  ..$ Bulk Richardson Number using CAPV          : chr "349.48"
  ..$ CAPE using virtual temperature             : chr "595.76"
  ..$ CINS using virtual temperature             : chr "-8.60"
  ..$ Convective Available Potential Energy      : chr "409.13"
  ..$ Convective Inhibition                      : chr "-26.90"
  ..$ Cross totals index                         : chr "19.00"
  ..$ Equilibrum Level                           : chr "228.72"
  ..$ Equilibrum Level using virtual temperature : chr "226.79"
  ..$ K index                                    : chr "14.40"
  ..$ Level of Free Convection                   : chr "819.49"
  ..$ LFCT using virtual temperature             : chr "871.25"
  ..$ LIFT computed using virtual temperature    : chr "-3.38"
  ..$ Lifted index                               : chr "-2.86"
  ..$ Mean mixed layer mixing ratio              : chr "17.45"
  ..$ Mean mixed layer potential temperature     : chr "299.97"
  ..$ Observation time                           : chr "190120/1200"
  ..$ Precipitable water [mm] for entire sounding: chr "46.56"
  ..$ Pres [hPa] of the Lifted Condensation Level: chr "938.33"
  ..$ Showalter index                            : chr "1.26"
  ..$ Station elevation                          : chr "5.0"
  ..$ Station identifier                         : chr "WMKC"
  ..$ Station latitude                           : chr "6.16"
  ..$ Station longitude                          : chr "102.28"
  ..$ Station number                             : chr "48615"
  ..$ SWEAT index                                : chr "187.99"
  ..$ Temp [K] of the Lifted Condensation Level  : chr "294.55"
  ..$ Totals totals index                        : chr "42.90"
  ..$ Vertical totals index                      : chr "23.90"

推荐答案

您可以尝试在 soundings_48615 上使用 parse_guess,它会将列转换为更合适的格式

You could try using parse_guess on soundings_48615 and it would convert the columns in preferable format

library(tidyverse)
library(readr)

new_df <- map(soundings_48615, . %>% mutate_all(parse_guess))


str(new_df)
#List of 4
# $ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  1 obs. of  30 variables:
#  ..$ 1000 hPa to 500 hPa thickness              : num 5778
#  ..$ Bulk Richardson Number                     : num 2094
#  ..$ Bulk Richardson Number using CAPV          : num 2472
#  ..$ CAPE using virtual temperature             : num 921
#  ..$ CINS using virtual temperature             : num -9.03
#  ..$ Convective Available Potential Energy      : num 780
#  ..$ Convective Inhibition                      : num -14.2
#  ..$ Cross totals index                         : num 21.7
#  ..$ Equilibrum Level                           : num 136
#....

这篇关于从网络抓取工作中将数据框转换为正确的格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆