用 rvest 进行网页抓取 [英] web scrape with rvest
问题描述
我正在尝试使用 r 包 rvest 中的 read_html 获取数据表.
I'm trying to grab a table of data using read_html from the r package rvest.
我已经尝试了以下代码:
I've tried the below code:
library(rvest)
raw <- read_html("https://demanda.ree.es/movil/peninsula/demanda/tablas/2016-01-02/2")
我不相信上面从表中提取数据,因为我看到原始"是 2 的列表:
I don't believe the above pulled the data from the table, since I see 'raw' is a list of 2:
'node:<externalptr>' and 'doc:<externalptr>'
我也试过抓取 xpath:
I've tried grabbing the xpath too:
html_nodes(raw,xpath = '//*[(@id = "tabla_generacion")]//*[contains(concat( " ", @class, " " ), concat( " ", "ng-scope", " " ))]')
关于下一步尝试什么有什么建议吗?
Any advice on what to try next?
谢谢.
推荐答案
本网站使用 angular 调用获取数据.您可以使用该调用来获取原始 JSON.响应不是纯 JSON,因此您不能只运行 fromJSON(url)
,您必须在解析数据之前下载数据并去除非 JSON 内容.
This website is using angular to make a call to get the data. You can just use that call to get the raw JSON. The response is not pure JSON, so you can't just run fromJSON(url)
, you have to download the data and get rid of the non-JSON stuff before you parse it.
library(jsonlite)
library(httr)
url <- "https://demanda.ree.es/WSvisionaMovilesPeninsulaRest/resources/demandaGeneracionPeninsula?callback=angular.callbacks._2&curva=DEMANDA&fecha=2016-01-02"
a <- GET(url)
a <- content(a, as="text")
# get rid of the non-JSON stuff...
a <- gsub("^angular.callbacks._2\\(", "", a)
a <- gsub("\\);$", "", a)
df <- fromJSON(a, simplifyDataFrame = TRUE)
我是通过在 Chrome 中按 F12 并查看来源"选项卡来发现这一点的.填充表格的数据必须来自某个地方......所以这只是弄清楚在哪里的问题.我无法使用 rvest 刮桌子.我不确定获取数据的调用是否在 R 中执行,因为它在 chrome 中执行......所以可能没有数据可供 rvest 抓取.
I found this by pushing F12 in Chrome and looking at the "Sources" tab. The data to fill the table had to come from somewhere... so it's just a matter of figuring out where. I was unable to use rvest to scrape the table. I'm not sure if that call to get the data was executed in R as it was in chrome... so there may have been no data for rvest to scrape.
这篇关于用 rvest 进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!