在R中使用`rvest`使用`read_html`时缺少元素 [英] Missing elements when using `read_html` using `rvest` in R
问题描述
我正在尝试使用 rvest
包中的 read_html
函数,但是遇到了我在努力解决的问题.
I'm trying to use the read_html
function in the rvest
package, but have come across a problem I am struggling with.
例如,如果我试图阅读此页面,我将使用以下代码:
For example, if I were trying to read in the bottom table that appears on this page, I would use the following code:
library(rvest)
html_content <- read_html("https://projects.fivethirtyeight.com/2016-election-forecast/washington/#now")
通过在浏览器中检查HTML代码,我可以看到我想要的内容包含在< table>
标记中(具体而言,它们全部包含在<表class ="t-calc">
).但是当我尝试使用以下方法提取该信息时:
By inspecting the HTML code in the browser, I can see that the content I would like is contained in a <table>
tag (specifically, it is all contained within <table class="t-calc">
). But when I try to extract this using:
tables <- html_nodes(html_content, xpath = '//table')
我检索了以下内容:
> tables
{xml_nodeset (4)}
[1] <table class="tippingpointroi unexpanded">\n <tbody>\n <tr data-state="FL" class=" "> ...
[2] <table class="tippingpointroi unexpanded">\n <tbody>\n <tr data-state="NV" class=" "> ...
[3] <table class="scenarios">\n <tbody/>\n <tr data-id="1">\n <td class="description">El ...
[4] <table class="t-desktop t-polls">\n <thead>\n <tr class="th-row">\n <th class="t ...
页面上包括一些表格元素,但我不感兴趣.
Which includes some of the table elements on the page, but not the one I am interested in.
任何关于我要去哪里的建议,将不胜感激!
Any suggestions on where I am going wrong would be most appreciated!
推荐答案
该表是根据页面本身上JavaScript变量中的数据动态构建的.使用 RSelenium
抓取呈现后的页面文本,并将页面传递到 rvest
中,或使用 V8抓取所有数据宝库代码>:
The table is built dynamically from data in JavaScript variables on the page itself. Either use RSelenium
to grab the text of the page after it's rendered and pass the page into rvest
OR grab a treasure trove of all the data by using V8
:
library(rvest)
library(V8)
URL <- "http://projects.fivethirtyeight.com/2016-election-forecast/washington/#now"
pg <- read_html(URL)
js <- html_nodes(pg, xpath=".//script[contains(., 'race.model')]") %>% html_text()
ctx <- v8()
ctx$eval(JS(js))
race <- ctx$get("race", simplifyVector=FALSE)
str(race) ## output too large to paste here
如果他们曾经更改过JavaScript的格式(这是一个自动化过程,因此不太可能,但您永远不会知道),那么 RSelenium
方法会更好,前提是他们不更改表格的格式结构(再次,不太可能,但您永远不会知道).
If they ever change the formatting of the JavaScript (it's an automated process so it's unlikely but you never know) then the RSelenium
approach will be better provided they don't change the format of the table structure (again, unlikely, but you never know).
这篇关于在R中使用`rvest`使用`read_html`时缺少元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!