使用RVest库在iframe中抓取表格 [英] Scraping table within iframe using R rvest library

查看:101
本文介绍了使用RVest库在iframe中抓取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对R的rvest库抓取网站很满意,但是却在尝试一些新的东西.从此网页- http://www.naia.org/ViewArticle.dbml?ATCLID= 205323044 -我正在尝试刮擦大学的主要桌子.

I am decent with R's rvest library for scraping websites, but am struggling with something new. From this webpage - http://www.naia.org/ViewArticle.dbml?ATCLID=205323044 - I am trying to scrape the main table of colleges.

这是我的代码当前的样子:

Here is what my code looks like currently:

NAIA_url = "http://www.naia.org/ViewArticle.dbml?ATCLID=205323044"
NAIA_page = read_html(NAIA_url)

tables = html_table(html_nodes(NAIA_page, 'table'))
# tables returns a length-2 list, however neither of these tables are the table I desire.

# grab the correct iframe node
iframe = html_nodes(NAIA_page, "iframe")[3] 

但是我正在努力克服这个问题. (1)由于某种原因,调用html_nodes不能获取我想要的表. (2),我不确定是否应该代替iframe,然后尝试从中获取表.

However I'm struggling past this. (1) for some reason calling html_nodes isn't grabbing the table I want. (2) and I'm not sure if I should instead grab the iframe and then try to grab the table from within it.

任何帮助表示赞赏!

推荐答案

如果嵌入式iframe是html,则可以获取iframe源并从此处获取所需的表.

If the embedded iframe is html, you can grab the iframe source and get your desired table from there.


library(rvest)
#> Loading required package: xml2
library(magrittr)
"http://www.naia.org/ViewArticle.dbml?ATCLID=205323044" %>%
  read_html() %>%
  html_nodes("iframe") %>%
  extract(3) %>% 
  html_attr("src") %>% 
  read_html() %>% 
  html_node("#searchResultsTable") %>% 
  html_table() %>%
  head()
#>                                   College or University       City, State
#> 1                   Central Christian College ATHLETICS     McPherson, KS
#> 2 +                   Crowley's Ridge College ATHLETICS     Paragould, AR
#> 3                       Edward Waters College ATHLETICS  Jacksonville, Fl
#> 4                 Fisher College ADMISSIONS | ATHLETICS        Boston, MA
#> 5       Georgia Gwinnett College ADMISSIONS | ATHLETICS Lawrenceville, GA
#> 6   Lincoln Christian University ADMISSIONS | ATHLETICS       Lincoln, IL
#>   Conference Enrollment
#> 1     A.I.I.        259
#> 2     A.I.I.          0
#> 3     A.I.I.        805
#> 4     A.I.I.        600
#> 5     A.I.I.      9,720
#> 6     A.I.I.      1,060

这篇关于使用RVest库在iframe中抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆