R:JSON 到 data.frame 的通用展平 [英] R: Generic flattening of JSON to data.frame

查看:23
本文介绍了R:JSON 到 data.frame 的通用展平的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是关于将任何非循环同构或异构数据结构集合转换为数据帧的通用机制.这在处理大量 JSON 文档的摄取或处理作为字典数组的大型 JSON 文档时特别有用.

This question is about a generic mechanism for converting any collection of non-cyclical homogeneous or heterogeneous data structures into a dataframe. This can be particularly useful when dealing with the ingestion of many JSON documents or with a large JSON document that is an array of dictionaries.

有几个 SO 问题涉及处理深度嵌套的 JSON 结构并使用 plyrlapply 等功能将它们转换为数据帧.所有问题和答案我发现是关于特定情况的,而不是提供处理复杂 JSON 数据结构集合的通用方法.

There are several SO questions that deal with manipulating deeply nested JSON structures and turning them into dataframes using functionality such as plyr, lapply, etc. All the questions and answers I have found are about specific cases as opposed to offering a general approach for dealing with collections of complex JSON data structures.

在 Python 和 Ruby 中,我通过实现通用数据结构展平实用程序得到了很好的服务,该实用程序使用数据结构中叶节点的路径作为展平数据结构中该节点的值的名称.例如,值 my_data[['x']][[2]][['y']] 将显示为 result[['x.2.y']].

In Python and Ruby I've been well-served by implementing a generic data structure flattening utility that uses the path to a leaf node in a data structure as the name of the value at that node in the flattened data structure. For example, the value my_data[['x']][[2]][['y']] would appear as result[['x.2.y']].

如果一个人拥有这些可能不完全同质的数据结构的集合,那么对数据帧进行成功展平的关键是发现所有可能的数据帧列的名称,例如,通过合并所有键/单独展平的数据结构中的值的名称.

If one has a collection of these data structures that may not be entirely homogeneous the key to doing a successful flattening to a dataframe would be to discover the names of all possible dataframe columns, e.g., by taking the union of all keys/names of the values in the individually flattened data structures.

这似乎是一种常见的模式,所以我想知道是否有人已经为 R 构建了这个.如果没有,我会构建它,但是鉴于 R 独特的基于 promise 的数据结构,我会很感激关于最小化堆抖动的实现方法.

This seems like a common pattern and so I'm wondering whether someone has already built this for R. If not, I'll build it but, given R's unique promise-based data structures, I'd appreciate advice on an implementation approach that minimizes heap thrashing.

推荐答案

嗨@Sim 我昨天有理由反思你的问题定义:

Hi @Sim I had cause to reflect on your problem yesterday define:

flatten<-function(x) {
    dumnames<-unlist(getnames(x,T))
    dumnames<-gsub("(*.)\.1","\1",dumnames)
    repeat {
        x <- do.call(.Primitive("c"), x)
        if(!any(vapply(x, is.list, logical(1)))){
           names(x)<-dumnames
           return(x)
        }
    }
}
getnames<-function(x,recursive){

    nametree <- function(x, parent_name, depth) {
        if (length(x) == 0) 
            return(character(0))
        x_names <- names(x)
        if (is.null(x_names)){ 
            x_names <- seq_along(x)
            x_names <- paste(parent_name, x_names, sep = "")
        }else{ 
            x_names[x_names==""] <- seq_along(x)[x_names==""]
            x_names <- paste(parent_name, x_names, sep = "")
        }
        if (!is.list(x) || (!recursive && depth >= 1L)) 
            return(x_names)
        x_names <- paste(x_names, ".", sep = "")
        lapply(seq_len(length(x)), function(i) nametree(x[[i]], 
            x_names[i], depth + 1L))
    }
    nametree(x, "", 0L)
}

(getnames改编自AnnotationDbi:::make.name.tree)

(getnames is adapted from AnnotationDbi:::make.name.tree)

(flatten 改编自这里的讨论 如何在没有强制的情况下将列表展平为列表?)

(flatten is adapted from discussion here How to flatten a list to a list without coercion?)

作为一个简单的例子

my_data<-list(x=list(1,list(1,2,y='e'),3))

> my_data[['x']][[2]][['y']]
[1] "e"

> out<-flatten(my_data)
> out
$x.1
[1] 1

$x.2.1
[1] 1

$x.2.2
[1] 2

$x.2.y
[1] "e"

$x.3
[1] 3

> out[['x.2.y']]
[1] "e"

所以结果是一个扁平列表,其中包含您建议的大致命名结构.也避免了强制,这是一个优点.

so the result is a flattened list with roughly the naming structure you suggest. Coercion is avoided also which is a plus.

一个更复杂的例子

library(RJSONIO)
library(RCurl)
json.data<-getURL("http://www.reddit.com/r/leagueoflegends/.json")
dumdata<-fromJSON(json.data)
out<-flatten(dumdata)

更新

删除尾随 .1 的简​​单方法

naive way to remove trailing .1

my_data<-list(x=list(1,list(1,2,y='e'),3))
gsub("(*.)\.1","\1",unlist(getnames(my_data,T)))

> gsub("(*.)\.1","\1",unlist(getnames(my_data,T)))
[1] "x.1"   "x.2.1" "x.2.2" "x.2.y" "x.3"  

这篇关于R:JSON 到 data.frame 的通用展平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆