从 JSON 生成的多级列表中提取数据框,偶尔会丢失元素 [英] Extracting to a data frame from a JSON generated multi-level list with occasional missing elements

查看:30
本文介绍了从 JSON 生成的多级列表中提取数据框,偶尔会丢失元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过 API 提取足球数据 - 生成的 JSON 作为列表返回;dput 示例如下:

I'm pulling soccer data through an API - the resulting JSON is returned as a list; dput example below:

list(list(id = 10332894L, league_id = 8L, season_id = 12962L, 
aggregate_id = NULL, venue_id = 201L, localteam_id = 51L, 
visitorteam_id = 27L, weather_report = list(code = "drizzle", 
    temperature = list(temp = 53.92, unit = "fahrenheit"), 
    clouds = "90%", humidity = "87%", wind = list(speed = "12.75 m/s", 
        degree = 200L)), attendance = 25098L, leg = "1/1", 
deleted = FALSE, referee = list(data = list(id = 15267L, 
    common_name = "L. Probert", fullname = "Lee Probert", 
    firstname = "Lee", lastname = "Probert"))), list(id = 10332895L, 
league_id = 8L, season_id = 12962L, aggregate_id = NULL, 
venue_id = 340L, localteam_id = 251L, visitorteam_id = 78L, 
weather_report = list(code = "drizzle", temperature = list(
    temp = 50.07, unit = "fahrenheit"), clouds = "90%", humidity = "93%", 
    wind = list(speed = "6.93 m/s", degree = 160L)), attendance = 22973L, 
leg = "1/1", deleted = FALSE, referee = list(data = list(
    id = 15273L, common_name = "M. Oliver", fullname = "Michael Oliver", 
    firstname = "Michael", lastname = "Oliver"))))

我目前正在使用 for 循环进行提取 - 当完整数据中有数百个时,reprex 会显示 2 个顶级列表项.使用循环的主要缺点是有时会丢失导致循环停止的值.我想将其移至 purrr,但正在努力使用 at_depthmodify_depth 提取第二级嵌套项.巢内也有巢,这确实增加了复杂性.

I'm extracting using a for loop at the moment - the reprex shows 2 top level list items when there are hundreds in the full data. The main drawback of using a loop is that there are sometimes missing values which cause the loop to stop. I'd like to move this to purrr but am struggling to extract 2nd level nested items using at_depth or modify_depth. There are also nests inside nests which really adds to the complexity.

结束状态应该是一个整洁的数据框——从这个数据来看,df 将只有 2 行,但将有许多列,每列代表一个项目,无论该项目嵌套在此列表中的哪个位置.如果缺少某些内容,则它应该是 NA 值.

The end-state should be a tidy data frame - from this data the df will only have 2 rows but will have many columns each representing an item, no matter where that item is nested in this list. If something's missing then it should be an NA value.

解决方案的理想场景,即使它可能不优雅,但每个级别/嵌套项目都有一个数据框,然后可以将其绑定在一起.

The ideal scenario for a solution, even though it may be inelegant is that there's a data frame per level / nested item produced that can then be bound together later.

谢谢.

推荐答案

Step1: 使用社区 wiki 的函数 这里

Step1: Replace NULL with NA using community wiki's function here

simple_rapply <- function(x, fn)
{
  if(is.list(x))
  {
    lapply(x, simple_rapply, fn)
  } else
  {
    fn(x)
  }
}    
non.null.l <- simple_rapply(l, function(x) if(is.null(x)) NA else x)

步骤 2:

library(purrr)
map_df(map(non.null.l,unlist),bind_rows)

这篇关于从 JSON 生成的多级列表中提取数据框,偶尔会丢失元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆