如何用R解析javascript数据列表 [英] How to parse javascript data list with R

查看:23
本文介绍了如何用R解析javascript数据列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 R 来解析 html 代码,我想知道最有效的方法来稀疏以下代码:

I use R to parse html code, and I would like to know the most efficient way to sparse the following code :

<script type="text/javascript">
var utag_data = {
  environnement : "prod",
  device : getDevice(),
  displaytype : getDisplay($(window).innerWidth()),
  pagename : "adview",
  pagetype : "annonce"}</script>

我开始这样做:

infos = unlist(xpathApply(page,
                          '//script[@type="text/javascript"]',
                          xmlValue))
infos=gsub('\n|  ','',infos)
infos=gsub("var utag_data = ","",infos)
fromJSON(infos)

上面的代码返回了一些非常奇怪的东西:

And the code above returns somthing really weird:

$nvironnemen
[1] "prod"

$evic
NULL

$isplaytyp
NULL

$agenam
[1] "adview" etc.

我想知道如何做非常有效的方法:如何直接解析javascript中的数据列表?谢谢.

I would like to know how to do it very efficient way: how to parse directly the data list in the javascript ? Thank you.

推荐答案

我没有尝试您的代码,但我认为您的 gsub() 正则表达式可能过于激进(这可能导致名称闷).

I didn't try your code, but I think your gsub() regexes might be overagressive (which is prbly causing the name munging).

可以使用 V8 包运行 javascript 代码,但它将无法执行基于 DOM 的 getDevice()getDisplay()函数,因为它们不存在于 V8 引擎中:

It's possible to run javascript code using the V8 package, but it wont be able to execute the DOM-based getDevice() and getDisplay() functions since they don't exist in the V8 engine:

library(V8)
library(rvest)

pg <- read_html('<script type="text/javascript">
var utag_data = {
  environnement : "prod",
  device : getDevice(),
  displaytype : getDisplay($(window).innerWidth()),
  pagename : "adview",
  pagetype : "annonce"}</script>')


script <- html_text(html_nodes(pg, xpath='//script[@type="text/javascript"]'))

ctx <- v8()

ctx$eval(script)
## Error: ReferenceError: getDevice is not defined

但是,您可以对此进行补偿:

However, you can compensate for that:

# we need to remove the function calls and replace them with blanks
# since both begin with 'getD' this is pretty easy:
script <- gsub("getD[[:alpha:]\\(\\)\\$\\.]+,", "'',", script)  

ctx$eval(script)
ctx$get("utag_data")

## $environnement
## [1] "prod"
## 
## $device
## [1] ""
## 
## $displaytype
## [1] ""
## 
## $pagename
## [1] "adview"
## 
## $pagetype
## [1] "annonce"

这篇关于如何用R解析javascript数据列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆