在R中刮取html表及其href链接 [英] Scraping html table and its href Links in R
问题描述
我正在尝试下载一个包含文本和链接的表.我可以成功下载带有链接文本"Pass"的表.但是,我想捕获实际的href URL,而不是文本.
I am trying to download a table that contains text and links. I can successfully download the table with the link text "Pass". However, instead of the text, I would like to capture the actual href URL.
library(dplyr)
library(rvest)
library(XML)
library(httr)
library(stringr)
link <- "http://www.qimedical.com/resources/method-suitability/"
qi_webpage <- read_html(link)
qi_table <- html_nodes(qi_webpage, 'table')
qi <- html_table(qi_table, header = TRUE)[[1]]
qi <- qi[,-1]
上面给出了一个不错的数据框.但是,当我希望将链接与之关联时,最后一列仅包含文本"Pass".我试图使用以下内容添加链接,但它们与正确的链接不符 行:
Above gives a nice dataframe. However the last column only contains the text "Pass" when I would like to have the link associated with it. I have tried to use the following to add the links, but they do not correspond to the correct row:
qi_get <- GET("http://www.qimedical.com/resources/method-suitability/")
qi_html <- htmlParse(content(qi_get, as="text"))
qi.urls <- xpathSApply(qi_html, "//*/td[7]/a", xmlAttrs, "href")
qi.urls <- qi.urls[1,]
qi <- mutate(qi, "MSTLink" = (ifelse(qi$`Study Protocol(click to download certification)` == "Pass", (t(qi.urls)), "")))
我对html,css等一无所知,所以我不确定要正确完成此操作我缺少什么.
I know little about html, css, etc, so I am not sure what I am missing to accomplish this properly.
谢谢!
推荐答案
您正在表单元td
中查找a
元素.然后,您需要href
属性的值.因此,这是一种方法,它将返回带有PDF下载的所有URL的向量:
You're looking for a
elements inside of table cells, td
. Then you want the value of the href
attribute. So here's one way, which will return a vector with all the URLs for the PDF downloads:
qi_webpage %>%
html_nodes(xpath = "//td/a") %>%
html_attr("href")
这篇关于在R中刮取html表及其href链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!