在R中刮取html表及其href链接 [英] Scraping html table and its href Links in R

查看:110
本文介绍了在R中刮取html表及其href链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试下载一个包含文本和链接的表.我可以成功下载带有链接文本"Pass"的表.但是,我想捕获实际的href URL,而不是文本.

I am trying to download a table that contains text and links. I can successfully download the table with the link text "Pass". However, instead of the text, I would like to capture the actual href URL.

library(dplyr)
library(rvest)
library(XML)
library(httr)
library(stringr)

link <- "http://www.qimedical.com/resources/method-suitability/"

qi_webpage <- read_html(link)

qi_table <- html_nodes(qi_webpage, 'table')
qi <- html_table(qi_table, header = TRUE)[[1]]
qi <- qi[,-1]

上面给出了一个不错的数据框.但是,当我希望将链接与之关联时,最后一列仅包含文本"Pass".我试图使用以下内容添加链接,但它们与正确的链接不符 行:

Above gives a nice dataframe. However the last column only contains the text "Pass" when I would like to have the link associated with it. I have tried to use the following to add the links, but they do not correspond to the correct row:

qi_get <- GET("http://www.qimedical.com/resources/method-suitability/")
qi_html <- htmlParse(content(qi_get, as="text"))

qi.urls <- xpathSApply(qi_html, "//*/td[7]/a", xmlAttrs, "href")
qi.urls <- qi.urls[1,]

qi <- mutate(qi, "MSTLink" = (ifelse(qi$`Study Protocol(click to download certification)` == "Pass", (t(qi.urls)), "")))

我对html,css等一无所知,所以我不确定要正确完成此操作我缺少什么.

I know little about html, css, etc, so I am not sure what I am missing to accomplish this properly.

谢谢!

推荐答案

您正在表单元td中查找a元素.然后,您需要href 属性的值.因此,这是一种方法,它将返回带有PDF下载的所有URL的向量:

You're looking for a elements inside of table cells, td. Then you want the value of the href attribute. So here's one way, which will return a vector with all the URLs for the PDF downloads:

qi_webpage %>%
  html_nodes(xpath = "//td/a") %>% 
  html_attr("href")

这篇关于在R中刮取html表及其href链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆