在 R 中使用 readHTMLTable 删除行 [英] Dropped rows using readHTMLTable in R

查看:35
本文介绍了在 R 中使用 readHTMLTable 删除行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 readHTMLTable 从 NOAA 提取模型数据.我试图获得的表格有多个副标题,据我从 HTML 中可以看出,每个副标题由一个跨越所有列的单元格组成.出于某种原因,这导致 readHTMLTable 忽略紧跟在副标题之后的行.以下是重现问题的代码:

I am attempting to extract model data from NOAA using readHTMLTable. The table I am trying to get has multiple subtitles, where each subtitle consists of a single cell spanning all columns, as far as I can tell from the HTML. For some reason, this is causing readHTMLTable to omit the row immediately following the subtitle. Here's code that will reproduce the issue:

library(XML)

url <- "http://nomads.ncep.noaa.gov/"
ncep.tables = readHTMLTable(url, header=TRUE)

#Find the list of real time models
for(ncep.table in ncep.tables) {
    if("grib filter" %in% names(ncep.table) & "gds-alt" %in% names(ncep.table)) {
        rt.tbl <- ncep.table
     }
}

#Here's where the problem is:
cat(paste(rt.tbl[["Data Set"]][15:20], collapse = "\n"))

#On the website, there is a model called "AQM Daily Maximum"
#between Regional Models and AQM Hourly Surface Ozone
#but it's missing now...

因此,如果您访问 http://nomads.ncep.noaa.gov/ 并且查看中央表格(右上方单元格中有数据集"的表格),您会看到一个名为区域模型"的副标题.上面代码中的提取过程中跳过了紧邻副标题下方的 AQM Daily Maximum 模型.

So, if you go to http://nomads.ncep.noaa.gov/ and look at the central table (the one with "Data Set" in the top right cell), you'll see a subtitle called "Regional Models." The AQM Daily Maximum model immediately below the subtitle is skipped during the extraction in the code above.

我在 R 中维护 rNOMADS 包,所以如果我能让它工作,它将为我节省大量维护包的时间,并为我的用户保持准确和最新.感谢您的帮助!

I maintain the rNOMADS package in R, so if I can get this to work it will save me loads of time maintaining the package as well as keep it accurate and up to date for my users. Thank you for your help!

推荐答案

天哪,我想我明白了.您将无法使用 readHTMLTable(而且,我现在比以前更了解 XML 包代码……代码中的一些严重的 R-fu)并且我正在使用 rvest 仅仅是因为我混合使用了 XPath 和 CSS 选择器(不过我最终更多地考虑了 XPath).dplyr 仅用于 gimpse.

By golly, I think I got it. You won't be able to use readHTMLTable (and, I now know the XML package code way more than I even did before…some serious R-fu in that code) and I'm using rvest simply because I mix use of XPath and CSS selectors (I ended up thinking more in XPath though). dplyr is only for gimpse.

library(XML)
library(dplyr)
library(rvest)

trim <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "", x)

# neither rvest::html nor rvest::html_session liked it, hence using XML::htmlParse
doc <- htmlParse("http://nomads.ncep.noaa.gov/")

ds <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
                                            descendant::td[contains(., 'http')]/
                                            preceding-sibling::td[3]")

data_set <- ds %>% html_text() %>% trim()
data_set_descr_link <- ds %>% html_nodes("a") %>% html_attr("href")

freq <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
                           descendant::td[contains(., 'hourly') or
                                          contains(., 'hours') or
                                          contains(., 'daily') or
                                          contains(., '06Z')]") %>%
  html_text() %>% trim()

grib_filter <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
                                  descendant::td[contains(., 'http')]/preceding-sibling::td[1]") %>%
  sapply(function(x) {
    ifelse(x %>% xpathApply("boolean(./a)"),
           x %>% html_node("a") %>% html_attr("href"),
           NA)
  })

http_link <- doc %>% html_nodes("a[href^='/pub/data/']") %>% html_attr("href")

gds_alt <- doc %>% html_nodes(xpath="//table/descendant::th[@class='nomads'][1]/../../
                              descendant::td[contains(., 'http')]/following-sibling::td[1]") %>%
  sapply(function(x) {
    ifelse(x %>% xpathApply("boolean(./a)"),
           x %>% html_node("a") %>% html_attr("href"),
           NA)
  })

nom <- data.frame(data_set,
                  data_set_descr_link,
                  freq,
                  grib_filter,
                  gds_alt)

glimpse(nom)

## Variables:
## $ data_set            (fctr) FNL, GFS 1.0x1.0 Degree, GFS 0.5x0.5 Degr...
## $ data_set_descr_link (fctr) txt_descriptions/fnl_doc.shtml, txt_descr...
## $ freq                (fctr) 6 hours, 6 hours, 6 hours, 12 hours, 6 ho...
## $ grib_filter         (fctr) cgi-bin/filter_fnl.pl, cgi-bin/filter_gfs...
## $ gds_alt             (fctr) dods-alt/fnl, dods-alt/gfs, dods-alt/gfs_...

head(nom)

##                             data_set
## 1                                FNL
## 2                 GFS 1.0x1.0 Degree
## 3                 GFS 0.5x0.5 Degree
## 4                 GFS 2.5x2.5 Degree
## 5       GFS Ensemble high resolution
## 6 GFS Ensemble Precip Bias-Corrected
##
##                                             data_set_descr_link     freq
## 1                                txt_descriptions/fnl_doc.shtml  6 hours
## 2                txt_descriptions/GFS_high_resolution_doc.shtml  6 hours
## 3                    txt_descriptions/GFS_half_degree_doc.shtml  6 hours
## 4                 txt_descriptions/GFS_Low_Resolution_doc.shtml 12 hours
## 5       txt_descriptions/GFS_Ensemble_high_resolution_doc.shtml  6 hours
## 6 txt_descriptions/GFS_Ensemble_precip_bias_corrected_doc.shtml    daily
##
##                       grib_filter          gds_alt
## 1           cgi-bin/filter_fnl.pl     dods-alt/fnl
## 2           cgi-bin/filter_gfs.pl     dods-alt/gfs
## 3        cgi-bin/filter_gfs_hd.pl  dods-alt/gfs_hd
## 4       cgi-bin/filter_gfs_2p5.pl dods-alt/gfs_2p5
## 5          cgi-bin/filter_gens.pl    dods-alt/gens
## 6 cgi-bin/filter_gensbc_precip.pl dods-alt/gens_bc

请确保列匹配.我盯着它看,但验证会很棒.注意:可能有更好的方法来做 sapply s(任何人都可以自由地编辑它,也相信自己).

Please make sure the columns match. I eyeballed it, but a verification would be awesome. NOTE: there may be a better way to do the sapplys (anyone shld feel freed to edit that in, too, crediting yourself).

这是真的脆弱的代码.即如果格式发生变化,它会发出咝咝声(但对于所有抓取来说都是如此).它应该能够承受他们实际创建有效的 HTML(顺便说一句,这是可悲的 HTML),但是大多数代码依赖于 http 列保持有效,因为 大多数 其他列提取依赖于在上面.您丢失的模型也在那里.如果任何 XPath 令人困惑,请发表评论 q,我将尝试解释".

It's really fragile code. i.e. if the format changes, it'll croak (but that's kinda true for all scraping). It should withstand them actually creating valid HTML (this is wretched HTML btw), but most of the code relies on the http column remaining valid since most of the other column extractions rely on it. Your missing model is there as well. If any of the XPath is confusing, drop a comment q and I'll try to 'splain.

这篇关于在 R 中使用 readHTMLTable 删除行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆