使用 R 进行网页抓取,Javascript 被禁用的消息 [英] Web scraping with R, message that Javascript is disabled

查看:38
本文介绍了使用 R 进行网页抓取,Javascript 被禁用的消息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我正在尝试使用 R 进行网络抓取,而这个特定的网站给我带来了很多麻烦.我想从这里提取表格:https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017

Hello I am attempting to webscrape in R and this one particular website is giving me a lot of trouble. I wish to extract the table from here: https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017

我尝试过的

代码:

url = 'https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'

webpage = read_html(url)

data = webpage %>% html_nodes('p') %>% html_text()
data

输出:

[1] "\r\n            The page could not be loaded. This web site 
currently does not fully support browsers with \"JavaScript\" disabled. 
Please note that if you choose to continue without enabling 
\"JavaScript\" certain functionalities on this website may not be 
available.\r\n  

推荐答案

在这种情况下,您可能需要使用 RSelenium 使用 docker 抓取 Javascript 网站

In this cases, you may want to use RSelenium with docker to scrape a Javascript website

require("RSelenium")
require("rvest")
system('docker run -d -p 4445:4444 selenium/standalone-firefox')

remDr <-  RSelenium::remoteDriver(
  remoteServerAddr = "localhost",
  port = 4445L,
  browserName = "firefox"
)

#Start the remote driver
remDr$open()


url = 'https://www.nationsreportcard.gov/profiles/stateprofile? 
chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'

remDr$navigate(url)

doc <- read_html(remDr$getPageSource()[[1]])
table <- doc %>%
         html_nodes(xpath = '//*[@id="gridAvergeScore"]/table') %>%
         html_table(fill=TRUE)

head(table[[1]])

##    JURISDICTION AVERAGE SCORE (0 - 500)              AVERAGE SCORE (0 - 500) ACHIEVEMENT LEVEL PERCENTAGES ACHIEVEMENT LEVEL PERCENTAGES
## 1  JURISDICTION                   Score Difference from National public (NP)             At or above Basic        At or above Proficient
## 2 Massachusetts                     249                                   10                            87                            53
## 3     Minnesota                     249                                   10                            86                            53
## 4         DoDEA                     249                                    9                            91                            51
## 5      Virginia                     248                                    9                            87                            50
## 6    New Jersey                     248                                    9                            87                            50

这篇关于使用 R 进行网页抓取,Javascript 被禁用的消息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆