将Json文件读取到没有嵌套列表的data.frame中 [英] Read Json file into a data.frame without nested lists

查看:79
本文介绍了将Json文件读取到没有嵌套列表的data.frame中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将json文件加载到r中的data.frame中.我在jsonlite包中的fromJSON函数方面有些运气-但是正在嵌套列表,并且不确定如何将输入拼合为二维data.frame. Jsonlite以data.frame的形式读取文件,但在某些变量中保留了嵌套列表.

I am trying to load a json file into a data.frame in r. I have had some luck with the fromJSON function in the jsonlite package - But am getting nested lists and am not sure how to flatten the input into a two dimensional data.frame. Jsonlite reads the file in as a data.frame, but leaves nested lists in some of the variables.

在使用嵌套列表读取JSON文件到data.frame时,有任何提示吗?

Does Anyone have any tips in loading a JSON file to a data.frame when it reads in with nested lists.

#*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*# HERE IS MY EXAMPLE #*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*#
# loads the packages
library("httr")
library( "jsonlite")

# downloads an example file
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE ) 

# the flatten function breaks the name variable into three vars ( first name, middle name, last name)
providers <- flatten( providers )

# but many of the columns are still lists:
sapply( providers , class)

# Some of these lists have a single level
head( providers$facility_type )

# Some have lot more than two - for example nine
providers[ , 6][[1]]

我想要每个npi一行,并且每个单独的列表的切片要有单独的列-以便数据框的cols分别为"plan_id_type","plan_id","network_tier"的九次,可能是colnames 0至8. 我已经可以使用此网站: http://www.convertcsv.com/json- to-csv.htm 以二维方式获取此文件,但是由于我要进行数百次操作,因此我希望能够动态地进行处理.这是文件: http://s000.tinyupload.com/download. php?file_id = 10808537503095762868& t = 1080853750309576286812811 -我想使用fromJson函数将具有这种结构的文件作为data.frame加载

I want one row per npi, and than seperate columns for each of the slices of the individual lists - so that the data frame has cols for "plan_id_type","plan_id","network_tier" nine times, maybe colnames, from 0 to 8. I have been able to use this site: http://www.convertcsv.com/json-to-csv.htm to get this file in two dimensions, but since I am doing hundreds of these I would love to be able to do it dynamically. This is the file: http://s000.tinyupload.com/download.php?file_id=10808537503095762868&t=1080853750309576286812811 - I would like to get a file with this structure load as a data.frame using the the fromJson function

这里是我尝试过的一些方法; 因此,我想到了两种方法: 首先:使用其他功能读取Json文件,我已经看过

HERE are a few of the things I have tried; So I have thought of two approaches; First: use a different function to read in the Json file, I have looked at

rjson but that reads in a list
library( rjson )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
class( providers )

并且我尝试了RJSONIO-我尝试了将导入的json数据获取到R中的数据框中

and I have tried RJSONIO - I tried this Getting imported json data into a data frame in R

json-data-into-a-data-frame-in-r
library( RJSONIO )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )

json_file <- lapply(providers, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

# but When converting the lists to a data.frame I get an error
a <- do.call("rbind", json_file)

所以,我尝试的第二种方法是将所有列表转换为data.frame中的变量.

So, the second approach I have tried is to convert all the lists into variables in my data.frame

detach("package:RJSONIO", unload = TRUE )
detach("package:rjson", unload = TRUE )

library( "jsonlite")
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE ) 
providers <- flatten( providers )

我能够拉出其中一个列表-但由于缺少,我无法重新合并到我的数据框中

I am able to pull one of the lists - but because of missings I can't merge back on to my dataframe

a <- data.frame(Reduce(rbind,  providers$facility_type))
length( a ) == nrow( providers )

我还尝试了以下建议:将嵌套列表转换为数据框.和其他东西一样好,但是没有运气

I also tried these suggestions: Converting nested list to dataframe. A well as some other stuff but haven't had any luck

a <- sapply( providers$facility_type, unlist )
as.data.frame(t(sapply( providers$providers, unlist )) )

非常感谢任何帮助

推荐答案

更新:2016年2月21日

col_fixer更新为包含一个vec2col参数,该参数使您可以将列表列展平为单个字符串或一组列.

Update: 21 February 2016

col_fixer updated to include a vec2col argument that lets you flatten a list column into either a single string or a set of columns.

在您下载的data.frame中,我看到了几种不同的列类型.有包含相同类型向量的普通列.在列表列中,项目可能是NULL或它们本身可能是平面向量.有列表列,其中data.frame作为列表元素.有些列表列包含与主data.frame相同行数的data.frame.

In the data.frame you've downloaded, I see several different column types. There are normal columns comprising vectors of the same type. There are list columns where the items may be NULL or may themselves be a flat vector. There are list columns where there are data.frames as the list elements. There are list columns that contain a data.frame of the same number of rows as the main data.frame.

以下是重新创建这些条件的示例数据集:

Here's a sample dataset that recreates those conditions:

mydf <- data.frame(id = 1:3, type = c("A", "A", "B"), 
                   facility = I(list(c("x", "y"), NULL, "x")),
  address = I(list(data.frame(v1 = 1, v2 = 2, v4 = 3), 
                   data.frame(v1 = 1:2, v2 = 3:4, v3 = 5), 
                   data.frame(v1 = 1, v2 = NA, v3 = 3))))

mydf$person <- data.frame(name = c("AA", "BB", "CC"), age = c(20, 32, 23),
                          preference = c(TRUE, FALSE, TRUE))

此示例data.framestr看起来像:

str(mydf)
## 'data.frame':    3 obs. of  5 variables:
##  $ id      : int  1 2 3
##  $ type    : Factor w/ 2 levels "A","B": 1 1 2
##  $ facility:List of 3
##   ..$ : chr  "x" "y"
##   ..$ : NULL
##   ..$ : chr "x"
##   ..- attr(*, "class")= chr "AsIs"
##  $ address :List of 3
##   ..$ :'data.frame': 1 obs. of  3 variables:
##   .. ..$ v1: num 1
##   .. ..$ v2: num 2
##   .. ..$ v4: num 3
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ v1: int  1 2
##   .. ..$ v2: int  3 4
##   .. ..$ v3: num  5 5
##   ..$ :'data.frame': 1 obs. of  3 variables:
##   .. ..$ v1: num 1
##   .. ..$ v2: logi NA
##   .. ..$ v3: num 3
##   ..- attr(*, "class")= chr "AsIs"
##  $ person  :'data.frame':    3 obs. of  3 variables:
##   ..$ name      : Factor w/ 3 levels "AA","BB","CC": 1 2 3
##   ..$ age       : num  20 32 23
##   ..$ preference: logi  TRUE FALSE TRUE
## NULL

您可以展平"此方法的一种方法是修复"列表列.有三个修复程序.

One way you can "flatten" this is to "fix" the list columns. There are three fixes.

  1. flatten(来自"jsonlite")将处理诸如"person"列之类的列.
  2. 可以使用toString固定设施"(facility)列之类的列,该列可以将每个元素转换为逗号分隔的项目,也可以转换成多个列.
  3. 存在data.frame的列(有些具有多行),首先需要将其展平为单行(通过转换为宽"格式),然后需要将它们绑定为单个data.table. (我正在使用"data.table"进行重塑并将行绑定在一起.)
  1. flatten (from "jsonlite") will take care of columns like the "person" column.
  2. Columns like the "facility" column can be fixed using toString, which would convert each element to a comma separated item or which can be converted into multiple columns.
  3. Columns where there are data.frames, some with multiple rows, first need to be flattened into a single row (by transforming to a "wide" format) and then need to be bound together as a single data.table. (I'm using "data.table" for reshaping and for binding the rows together).

我们可以使用以下函数来处理第二点和第三点:

We can take care of the second and third points with a function like the following:

col_fixer <- function(x, vec2col = FALSE) {
  if (!is.list(x[[1]])) {
    if (isTRUE(vec2col)) {
      as.data.table(data.table::transpose(x))
    } else {
      vapply(x, toString, character(1L))
    }
  } else {
    temp <- rbindlist(x, use.names = TRUE, fill = TRUE, idcol = TRUE)
    temp[, .time := sequence(.N), by = .id]
    value_vars <- setdiff(names(temp), c(".id", ".time"))
    dcast(temp, .id ~ .time, value.var = value_vars)[, .id := NULL]
  }
}

我们将把它和flatten函数集成到另一个将执行大部分处理的函数中.

We'll integrate that and the flatten function in another function that would do most of the processing.

Flattener <- function(indf, vec2col = FALSE) {
  require(data.table)
  require(jsonlite)
  indf <- flatten(indf)
  listcolumns <- sapply(indf, is.list)
  newcols <- do.call(cbind, lapply(indf[listcolumns], col_fixer, vec2col))
  indf[listcolumns] <- list(NULL)
  cbind(indf, newcols)
}

运行该功能会给我们:

Flattener(mydf)
##   id type person.name person.age person.preference facility address.v1_1
## 1  1    A          AA         20              TRUE     x, y            1
## 2  2    A          BB         32             FALSE                     1
## 3  3    B          CC         23              TRUE        x            1
##   address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2 address.v3_1
## 1           NA            2           NA            3           NA           NA
## 2            2            3            4           NA           NA            5
## 3           NA           NA           NA           NA           NA            3
##   address.v3_2
## 1           NA
## 2            5
## 3           NA

或者,将向量分为不同的列:

Or, with the vectors going into separate columns:

Flattener(mydf, TRUE)
##   id type person.name person.age person.preference facility.V1 facility.V2
## 1  1    A          AA         20              TRUE           x           y
## 2  2    A          BB         32             FALSE        <NA>        <NA>
## 3  3    B          CC         23              TRUE           x        <NA>
##   address.v1_1 address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2
## 1            1           NA            2           NA            3           NA
## 2            1            2            3            4           NA           NA
## 3            1           NA           NA           NA           NA           NA
##   address.v3_1 address.v3_2
## 1           NA           NA
## 2            5            5
## 3            3           NA

这是str:

str(Flattener(mydf))
## 'data.frame':    3 obs. of  14 variables:
##  $ id               : int  1 2 3
##  $ type             : Factor w/ 2 levels "A","B": 1 1 2
##  $ person.name      : Factor w/ 3 levels "AA","BB","CC": 1 2 3
##  $ person.age       : num  20 32 23
##  $ person.preference: logi  TRUE FALSE TRUE
##  $ facility         : chr  "x, y" "" "x"
##  $ address.v1_1     : num  1 1 1
##  $ address.v1_2     : num  NA 2 NA
##  $ address.v2_1     : num  2 3 NA
##  $ address.v2_2     : num  NA 4 NA
##  $ address.v4_1     : num  3 NA NA
##  $ address.v4_2     : num  NA NA NA
##  $ address.v3_1     : num  NA 5 3
##  $ address.v3_2     : num  NA 5 NA
## NULL

在您的提供者"对象上,它非常一致地快速运行:

On your "providers" object, this runs very quickly and consistently:

library(microbenchmark)
out <- microbenchmark(Flattener(providers), Flattener(providers, TRUE), flattenList(jsonRList))
out
# Unit: milliseconds
#                        expr        min         lq      mean    median        uq       max neval
#        Flattener(providers)  104.18939  126.59295  157.3744  138.4185  174.5222  308.5218   100
#  Flattener(providers, TRUE)   67.56471   86.37789  109.8921   96.3534  121.4443  301.4856   100
#      flattenList(jsonRList) 1780.44981 2065.50533 2485.1924 2269.4496 2694.1487 4397.4793   100

library(ggplot2)
qplot(y = time, data = out, colour = expr) ## Via @TylerRinker

这篇关于将Json文件读取到没有嵌套列表的data.frame中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆