在R/Rvest中抓取JavaScript对象并转换为JSON [英] Scraping a JavaScript object and converting to JSON within R/Rvest

查看:152
本文介绍了在R/Rvest中抓取JavaScript对象并转换为JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取以下网站: https: //www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio

I am scraping the following website: https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio

我正在尝试通过rvest软件包将货币汇率表放入R数据框中,但是该表本身是在HTML代码中的JavaScript变量中配置的.

I am trying to get the table of currency exchange rates into an R data frame via the rvest package, but the table itself is configured in a JavaScript variable within the HTML code.

我找到了相关的CSS选择器,现在有了这个:

I located the relevant css selector and now I have this:

library(rvest)    
banorte <- "https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio/" %>%
      read_html() %>%
      html_nodes('#indicadores_financieros_wrapper > script:nth-child(2)')

我的输出现在是以下JavaScript脚本,作为XML节点集:

my output is now the following JavaScript script, as an XML nodeset:

<script>
$(document).ready(function(){
    var valor = '{"tablaDivisas":[{"nombreDivisas":"FRANCO SUIZO","compra":"18.60","venta":"19.45"}, {"nombreDivisas":"LIBRA ESTERLINA","compra":"24.20","venta":"25.15"}, {"nombreDivisas":"YEN JAPONES","compra":"0.1635","venta":"0.171"}, {"nombreDivisas":"CORONA SUECA","compra":"2.15","venta":"2.45"}, {"nombreDivisas":"DOLAR CANADA","compra":"14.50","venta":"15.35"}, {"nombreDivisas":"EURO","compra":"21.75","venta":"22.60"}], "tablaDolar":[{"nombreDolar":"VENTANILLA","compra":"17.73","venta":"19.15"}]}';
    if(valor != '{}'){
        var objJSON = eval("(" + valor + ")");
        var tabla="<tbody>";
        for ( var i = 0; i < objJSON["tablaDolar"].length; i++) {
            tabla+= "<tr>";
            tabla+= "<td>" + objJSON["tablaDolar"][i].nombreDolar + "</td>";
            tabla+= "<td>$" + objJSON["tablaDolar"][i].compra + "</td>";
            tabla+= "<td>$" + objJSON["tablaDolar"][i].venta + "</td>";
            tabla+= "</tr>";
        }
        tabla+= "</tbody>";
        $("#tablaDolar").append(tabla);
        var tabla2="";
        for ( var i = 0; i < objJSON["tablaDivisas"].length; i++) {
            tabla2+= "<tr>";
            tabla2+= "<td>" + objJSON["tablaDivisas"][i].nombreDivisas + "</td>";
            tabla2+= "<td>$" + objJSON["tablaDivisas"][i].compra + "</td>";
            tabla2+= "<td>$" + objJSON["tablaDivisas"][i].venta + "</td>";
            tabla2+= "</tr>";
        }
        tabla2+= "</tbody>";
        $("#tablaDivisas").append(tabla2);
    }
    bmnIndicadoresResponsivoInstance.cloneResponsive(0);
});
</script>

我的问题是,如何删除几乎所有内容(所有JavaScript函数/运算符)以仅获取此数据,以最终将其转换为这样的JSON表:

My question is, how do I remove almost everything (all the JavaScript functions/operators) to get only this data with the intention of converting it eventually to a JSON table like this:

{"tablaDivisas":[{"nombreDivisas":"FRANCO SUIZO","compra":"18.60","venta":"19.45"},
{"nombreDivisas":"LIBRA ESTERLINA","compra":"24.20","venta":"25.15"},
{"nombreDivisas":"YEN JAPONES","compra":"0.1635","venta":"0.171"}, 
{"nombreDivisas":"CORONA SUECA","compra":"2.15","venta":"2.45"}, 
{"nombreDivisas":"DOLAR CANADA","compra":"14.50","venta":"15.35"}, 
{"nombreDivisas":"EURO","compra":"21.75","venta":"22.60"}],
"tablaDolar":[{"nombreDolar":"VENTANILLA","compra":"17.73","venta":"19.15"}]}

换句话说,我需要使用R从JS脚本中提取"valor"变量.

In other words, I need to extract the "valor" variable from the JS script using R.

由于某种原因,我在R内完成所有操作时遇到了麻烦(无需将变量导出为外部.txt文件,然后使用子字符串)

For some reason I've had trouble getting this done all within R (without having to export the variable as an external .txt file and then using a substring)

推荐答案

肯定是重量级的答案,但可以概括为其他更粗糙的"javascript问题".

Definitely a bit more heavyweight answer but generalizes to other, more gnarly "javascript problems".

library(rvest)
library(stringi)
library(V8)
library(tidyverse)

banorte <- "https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio/" %>%
      read_html() %>%
      html_nodes('#indicadores_financieros_wrapper > script:nth-child(2)')

我们将设置一个javascript V8上下文:

We'll setup a javascript V8 context:

ctx <- v8()

然后:

  • 获取<script>内容
  • 将其拆分为行
  • 将其放入普通字符向量中
  • 清除残留物
  • 评估javascript
  • get the <script> content
  • split it into lines
  • get it into a plain character vector
  • remove the cruft
  • evaluate the javascript

这还不错:

html_text(banorte) %>% 
  stri_split_lines() %>% 
  flatten_chr() %>% 
  keep(stri_detect_regex, "^\tvar") %>% 
  ctx$eval()

由于javascript是JSON字符串,因此我们在R vs V8中进行了评估:

Since that javascript is a JSON string, we do the eval in R vs V8:

jsonlite::fromJSON(ctx$get("valor"))
## $tablaDivisas
##     nombreDivisas compra venta
## 1    FRANCO SUIZO  18.60 19.45
## 2 LIBRA ESTERLINA  24.20 25.15
## 3     YEN JAPONES 0.1635 0.171
## 4    CORONA SUECA   2.15  2.45
## 5    DOLAR CANADA  14.50 15.35
## 6            EURO  21.75 22.60
## 
## $tablaDolar
##   nombreDolar compra venta
## 1  VENTANILLA  17.73 19.15

如果在javascript中进行了其他有用的处理,则概括起来会更好.

If there had been other, useful processing in javascript, this generalizes better.

注意:Google在我的Chrome Beta版中翻译的网站翻译得不好,但是我认为您完全违反了TérminosLegales"页面上第6项的精神,但是直到我将其翻译完为止不能完全说出来.如果可以,并且似乎你在,我会删除它.

NOTE: Google translate in my Chrome beta channel was not translating the site well but I think you're awfully close to being in violation of the spirit of item 6 on the "Términos Legales" page but until I can translate it I can't fully tell. When/if I can and it seems like you are I'll delete this.

这篇关于在R/Rvest中抓取JavaScript对象并转换为JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆