使用R中的超链接将HTML表格读入数据框中 [英] Read HTML Table Into Data Frame with Hyperlinks in R

查看:175
本文介绍了使用R中的超链接将HTML表格读入数据框中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从一个可公开访问的网站上读取一个HTML表格到R中的一个数据框中。表格的最后一列包含超链接,我想将这些超链接阅读到表格中,而不是文本显示在网页上。我已经在StackOverflow和其他网站上评论过几篇文章,并且几乎已经到了,但我还没有能够自己阅读超链接。



表I '试图阅读是在这里: http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load% 20Zones%20and%20Trading%20Hubs& showHTMLView =& mimicKey

最后一列包含指向* .ZIP文件格式中实际数据的超链接下载。我设法将表格读入R中作为文本,但我无法弄清楚如何解决最后一列中的超链接。



这就是我所拥有的远:

 库(XML)
webURL< - 'http://mis.ercot.com/misapp /GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
page< - htmlParse(webURL)
tableNodes< - getNodeSet(sitePage,// table)
myTable< - readHTMLTable(tableNodes [[3]])

但是,这包含最后一列中的文本,而不是超链接。如何将R中此表的最后一列中的zip一词替换为每一行中相应超链接的值?

此代码可让您定位XML文件或CSV文件,并获取文件名以及URL,以便您可以迭代URL和文件名并将它们保存为稍后会识别的名称。

  library(rvest)
library(dplyr)

pg < - read_html( http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey)

csv_fils< - html_nodes(pg,xpath =.// td [contains(@class,'labelOptional_ind')and contains(。,'csv')] / ..)

data_frame(
fil_name = html_nodes(csv_fils,td.labelOptional_ind)%>%html_text(),$ b $ url url = html_nodes(csv_fils,xpath =.// td [4 ] / div / a)%>%html_attr(href)
) - & GT; csv_df

glimpse(csv_df)
##观察值:1,560
##变量:2
## $ fil_name< chr> cdr.00012300.0000000000000000.20170729.094015151.LMPSROSNODENP6788_20170729_094011_csv.zip,cdr ...
## $ url< chr>/ misdownload / servlets / mirDownload?mimic_duns =& doclookupId = 572923018,/ misdownload / servlets / mirD ...

xml_fils< - html_nodes(pg,xpath =.// td [contains(@class,'labelOptional_ind')and contains(。''xml')] / ..)

data_frame(
fil_name = html_nodes(xml_fils,td.labelOptional_ind)%>%html_text(),
url = html_nodes(xml_fils,xpath = .//td[4]/div/a)%>%html_attr(href)
) - > xml_df

glimpse(xml_df)
##观察值:1,560
##变量:2
## $ fil_name< chr> cdr.00012300.0000000000000000.20170729.094015016.LMPSROSNODENP6788_20170729_094011_xml.zip,cdr ...
## $ url< chr>/ misdownload / servlets / mirDownload?mimic_duns =& doclookupId = 572923015,/ misdownload / servlets / mirD ...


I am trying to read an HTML table from a publicly-accessible website into a data frame in R. The final column of the table contains hyperlinks, and I would like to read these hyperlinks into the table rather than the text that is displayed on the webpage. I've reviewed several posts here on StackOverflow and on other sites and have gotten almost there, but I haven't been able to read the hyperlinks themselves.

The table I'm trying to read is here: http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey.

The final column contains hyperlinks that point to the actual data in *.ZIP file format for download. I've managed to read the table into R as text, but I can't figure out how to resolve the hyperlinks in the final column.

Here's what I have so far:

library(XML)
webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
page <- htmlParse( webURL )
tableNodes <- getNodeSet( sitePage, "//table" )
myTable <- readHTMLTable( tableNodes[[3]] )

However, this contains the text in the final column, not the hyperlink. How do I replace the word "zip" in the final column of this table in R with the values for the corresponding hyperlink in each row?

解决方案

This code will let you target either the XML files or the CSV files and you get the filename as well as the URL so you can then iterate over the URLs and filenames and save them with names you'll recognize later on.

library(rvest)
library(dplyr)

pg <- read_html("http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey")

csv_fils <- html_nodes(pg, xpath=".//td[contains(@class, 'labelOptional_ind') and contains(., 'csv')]/..")

data_frame(
  fil_name = html_nodes(csv_fils, "td.labelOptional_ind") %>% html_text(),
  url = html_nodes(csv_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> csv_df

glimpse(csv_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015151.LMPSROSNODENP6788_20170729_094011_csv.zip", "cdr...
## $ url      <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923018", "/misdownload/servlets/mirD...

xml_fils <- html_nodes(pg, xpath=".//td[contains(@class, 'labelOptional_ind') and contains(., 'xml')]/..")

data_frame(
  fil_name = html_nodes(xml_fils, "td.labelOptional_ind") %>% html_text(),
  url = html_nodes(xml_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> xml_df

glimpse(xml_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015016.LMPSROSNODENP6788_20170729_094011_xml.zip", "cdr...
## $ url      <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923015", "/misdownload/servlets/mirD...

这篇关于使用R中的超链接将HTML表格读入数据框中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆