在 R 和 rvest 中抓取多个链接的 HTML 表 [英] scrape multiple linked HTML tables in R and rvest

查看：35 发布时间：2021/7/14 18:35:35 r web-scraping rvest

本文介绍了在 R 和 rvest 中抓取多个链接的 HTML 表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

本文http://www.ajnr.org/content/30/7/1402.full 包含四个指向我想用 rvest 抓取的 html 表格的链接.

This article http://www.ajnr.org/content/30/7/1402.full contains four links to html-tables which I would like to scrape with rvest.

借助 css 选择器:

With help of the css selector:

"#T1 a"

可以像这样到达第一个表:

it's possible to get to the first table like this:

library("rvest")
html_session("http://www.ajnr.org/content/30/7/1402.full") %>%
follow_link(css="#T1 a") %>%
html_table() %>%
View()

css 选择器:

".table-inline li:nth-child(1) a"

可以选择包含链接到四个表的标签的所有四个 html 节点:

makes it possible to select all four html-nodes containing the tags linking to the four tables:

library("rvest")
html("http://www.ajnr.org/content/30/7/1402.full") %>%
html_nodes(css=".table-inline li:nth-child(1) a")

如何遍历这个列表并一次性检索所有四个表?最好的方法是什么?

How would it be possible to loop through this list and retrieve all four tables in one go? What's the best approach?

推荐答案

这是一种方法:

library(rvest)

url <- "http://www.ajnr.org/content/30/7/1402.full"
page <- read_html(url)

# First find all the urls
table_urls <- page %>% 
  html_nodes(".table-inline li:nth-child(1) a") %>%
  html_attr("href") %>%
  xml2::url_absolute(url)

# Then loop over the urls, downloading & extracting the table
lapply(table_urls, . %>% read_html() %>% html_table())

这篇关于在 R 和 rvest 中抓取多个链接的 HTML 表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 R 和 rvest 中抓取多个链接的 HTML 表 [英] scrape multiple linked HTML tables in R and rvest

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 R 和 rvest 中抓取多个链接的 HTML 表 [英] scrape multiple linked HTML tables in R and rvest

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭