Rrvest无法获取html_node [英] R rvest Can't Get html_node
本文介绍了Rrvest无法获取html_node的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一些使用rvest
包从Web上抓取所需数据的经验,但我遇到了此页面的问题:
https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html
如果您向下滚动一点,您将看到所有学校所在的部分。
我想要学校、案例和地点的数据。我应该注意到,有人在NYT GitHub上要求将此发布为CSV,他们recommended that the data is all in the page and can just be pulled from there.因此,我认为从这个页面上刮下来是可以的。
但我不能让它工作。假设我只想从第一所学校的一个简单选择器开始。我使用检查器查找XPath。
我没有得到任何结果:
library(rvest)
URL <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"
pg <- read_html(URL)
# xpath copied from inspector
xpath_first_school <- '//*[@id="school100663"]'
node_first_school <- html_node(pg, xpath = xpath_first_school)
> node_first_school
{xml_missing}
<NA>
我收到{xml_missing}
。
推荐答案
所以我将在这里提供一个违反a very important rule described here并且通常是一个难看的解决方案的答案。但它是一种解决方案,使我们不必使用硒。
要对此使用html_nodes
,我们需要启动JS操作,这需要Selence。@KWN的解决方案似乎在他们的机器上有效,但我无法让chromeDriver在我的机器上工作。我可以使用Docker和Firefox或Chrome一起使用几乎,但无法获得结果。所以我会先检查一下这个解决方案。如果失败了,那就试一试。很大程度上,这个站点有我需要作为JSON公开的数据。因此,我提取站点的文本,并使用正则表达式分离JSON,然后jsonlite
进行解析。
library(jsonlite)
library(rvest)
library(tidyverse)
url <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"
html_res <- read_html(url)
# get text
text_res <- html_res %>%
html_text(trim = TRUE)
# find the area of interest
# find the area of interest
data1 <- str_extract_all(text_res, "(?<=var NYTG_schools = ).*(?=;)")[[1]]
# get json into data frame
json_res <- fromJSON(data1)
# did it work?
glimpse(json_res)
Rows: 1,515
Columns: 16
$ ipeds_id <chr> "100663", "199120", "132903", "100751"...
$ nytname <chr> "University of Alabama at Birmingham",...
$ shortname <chr> "U.A.B.", "North Carolina", "Central F...
$ city <chr> "Birmingham", "Chapel Hill", "Orlando"...
$ state <chr> "Ala.", "N.C.", "Fla.", "Ala.", "Ala."...
$ county <chr> "Jefferson", "Orange", "Orange", "Tusc...
$ fips <chr> "01073", "37135", "12095", "01125", "0...
$ lat <dbl> 33.50199, 35.90491, 28.60258, 33.21402...
$ long <dbl> -86.80644, -79.04691, -81.20223, -87.5...
$ logo <chr> "https://static01.nyt.com/newsgraphics...
$ infected <int> 972, 835, 727, 568, 557, 509, 504, 500...
$ death <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
$ dateline <chr> "n", "n", "n", "n", "n", "n", "n", "n"...
$ ranking <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...
$ medicalnote <chr> "y", NA, NA, NA, NA, NA, NA, NA, NA, N...
$ coord <list> [<847052.5, -406444.3>, <1508445.93, ...
这篇关于Rrvest无法获取html_node的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文