使用 rvest 在 R 中抓取交互式表格 [英] scraping an interactive table in R with rvest

查看:60
本文介绍了使用 rvest 在 R 中抓取交互式表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从以下链接抓取滚动表:http://proximityone.com/cd114_2013_2014.htm

I'm trying to scrape the scrolling table from the following link: http://proximityone.com/cd114_2013_2014.htm

我正在使用 rvest,但无法为表格找到正确的 xpath.我目前的代码如下:

I'm using rvest but am having trouble finding the correct xpath for the table. My current code is as follows:

url <- "http://proximityone.com/cd114_2013_2014.htm" 
table <- gis_data_html %>%
html_node(xpath = '//span') %>%
html_table()

目前我收到错误没有适用于 'html_table' 的方法应用于类xml_missing"的对象"

Currently I get the error "no applicable method for 'html_table' applied to an object of class "xml_missing""

有人知道我需要更改什么才能抓取链接中的交互式表格吗?

Anyone know what I would need to change to scrape the interactive table in the link?

推荐答案

所以你面临的问题是 rvest 会读取页面的源代码,但不会执行 javascript在页面上.当我检查交互式表格时,我看到

So the problem you're facing is that rvest will read the source of a page, but it won't execute the javascript on the page. When I inspect the interactive table, I see

<textarea id="aw52-box-focus" class="aw-control-focus " tabindex="0" 
onbeforedeactivate="AW(this,event)" onselectstart="AW(this,event)" 
onbeforecopy="AW(this,event)" oncut="AW(this,event)" oncopy="AW(this,event)" 
onpaste="AW(this,event)" style="z-index: 1; width: 100%; height: 100%;">
</textarea>

但是当我查看页面源代码时,aw52-box-focus"不存在.这是因为它是在页面通过 javascript 加载时创建的.

but when I look at the page source, "aw52-box-focus" doesn't exist. This is because it's created as the page loads via javascript.

您有多种选择来处理这个问题.简单"的是使用 RSelenium 并使用实际浏览器加载页面,然后在加载后获取元素.另一个选项是通读 javascript 并查看它从哪里获取数据,然后利用它而不是抓取表格.

You have a couple of options to deal with this. The 'easy' one is to use RSelenium and use an actual browser to load the page and then get the element after it's loaded. The other options is to read through the javascript and see where it's getting the data from and then tap into that rather than scraping the table.

更新

事实证明,阅读 javascript 真的很容易——它只是加载一个 CSV 文件.地址为纯文本,http://proximityone.com/countytrends/cd114_acs2014utf8_hl.csv

Turns out it's really easy to read the javascript - it's just loading a CSV file. The address is in plain text, http://proximityone.com/countytrends/cd114_acs2014utf8_hl.csv

.csv 没有列标题,但那些也在

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆