在R中使用`rvest`使用`read_html`时缺少元素 [英] Missing elements when using `read_html` using `rvest` in R

查看:65
本文介绍了在R中使用`rvest`使用`read_html`时缺少元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 rvest 包中的 read_html 函数,但是遇到了我在努力解决的问题.

I'm trying to use the read_html function in the rvest package, but have come across a problem I am struggling with.

例如,如果我试图阅读页面,我将使用以下代码:

For example, if I were trying to read in the bottom table that appears on this page, I would use the following code:

library(rvest)
html_content <- read_html("https://projects.fivethirtyeight.com/2016-election-forecast/washington/#now")

通过在浏览器中检查HTML代码,我可以看到我想要的内容包含在< table> 标记中(具体而言,它们全部包含在<表class ="t-calc"> ).但是当我尝试使用以下方法提取该信息时:

By inspecting the HTML code in the browser, I can see that the content I would like is contained in a <table> tag (specifically, it is all contained within <table class="t-calc">). But when I try to extract this using:

tables <- html_nodes(html_content, xpath = '//table')

我检索了以下内容:

> tables
{xml_nodeset (4)}
[1] <table class="tippingpointroi unexpanded">\n  <tbody>\n    <tr data-state="FL" class=" "> ...
[2] <table class="tippingpointroi unexpanded">\n  <tbody>\n    <tr data-state="NV" class=" "> ...
[3] <table class="scenarios">\n  <tbody/>\n  <tr data-id="1">\n    <td class="description">El ...
[4] <table class="t-desktop t-polls">\n  <thead>\n    <tr class="th-row">\n      <th class="t ...

页面上包括一些表格元素,但我不感兴趣.

Which includes some of the table elements on the page, but not the one I am interested in.

任何关于我要去哪里的建议,将不胜感激!

Any suggestions on where I am going wrong would be most appreciated!

推荐答案

该表是根据页面本身上JavaScript变量中的数据动态构建的.使用 RSelenium 抓取呈现后的页面文本,并将页面传递到 rvest 中,或使用 V8抓取所有数据宝库代码>:

The table is built dynamically from data in JavaScript variables on the page itself. Either use RSelenium to grab the text of the page after it's rendered and pass the page into rvest OR grab a treasure trove of all the data by using V8:

library(rvest)
library(V8)

URL <- "http://projects.fivethirtyeight.com/2016-election-forecast/washington/#now"

pg <- read_html(URL)

js <- html_nodes(pg, xpath=".//script[contains(., 'race.model')]") %>%  html_text()

ctx <- v8()
ctx$eval(JS(js))

race <- ctx$get("race", simplifyVector=FALSE)

str(race) ## output too large to paste here

如果他们曾经更改过JavaScript的格式(这是一个自动化过程,因此不太可能,但您永远不会知道),那么 RSelenium 方法会更好,前提是他们不更改表格的格式结构(再次,不太可能,但您永远不会知道).

If they ever change the formatting of the JavaScript (it's an automated process so it's unlikely but you never know) then the RSelenium approach will be better provided they don't change the format of the table structure (again, unlikely, but you never know).

这篇关于在R中使用`rvest`使用`read_html`时缺少元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆