从嵌套列表提取到数据框 [英] Extracting from Nested list to data frame

查看:113
本文介绍了从嵌套列表提取到数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将把底部的列表看起来像 c> dput ,使得q可以重现。 dput是 a 不是 x



我有一个名为 x 的大嵌套列表,我正在尝试构建一个数据框架,但无法弄清楚。



<我已经做了第一部分:

  for(i in 1:3){a [[i]]< -x $ results [[i]] $ experience 
indx < - length(a)
zz < - as.data.frame(do.call(rbind,lapply(a,`length< -a,max(indx)))}}

为此,我使用了以下答案:
将嵌套列表(不等长度)转换为数据框



这给我留下了一个数据框,其中n列是n列,其中n是任何i的最大结果:

  v1 v2 v3 
1 NULL NULL NULL
2 * * *
3 NULL NULL NULL
/ pre>

每个*是格式为的另一个嵌套列表(experience = list(duration = ...



例如第2行第1列中的第一个 * 我不想要总列表。我只想要:

  a [[2]] [[1]] $ experience $ start 

或根据原始列表x:

  x $ results [[2]] $ experience [[1]] $ experience $ start 



我觉得我差不多有了一些调整。我试过:

  for(i in 1:3){a [[i]]< -x $ results [[ i]] $ a 
indx< - length(a)
for(y in 1:length(a [[i]]))aa< - rbind(aa,tryCatch(x $ results $ [$] $ experience $ start,error = function(e)print(NA))
zz< - as.data.frame(do.call(rbind,lapply (aa,`length< -`,max(indx)))}}

导致: / p>

  v1 v2 v3 
1 NA NA NA
2 NA NA NA
3 2014 NA NA
4 2012 NA NA
5 2006 NA NA
6 NA NA NA
7 NA NA NA

在最后一行尝试cbind而不是rbind,并将所有日期放在第一行。



我还尝试了以下:

  for(i in 1:3){a [[i]]< -lengths(x $ results [[ ($)
indx< - length(a)
for(y in 1:length(indx)){tt [i]< - tryCatch(x $ results [[i] $ experience [[y]] $ experience $ start,error = fu (e)print())}
zz< - as.data.frame(do.call(rbind,lapply(tt,`length< -`,max(indx)))}

这是接近的,建立正确的格式,但只返回第一个结果:

  v1 v2 v3 
1 NA NA NA
2 2014 NA NA
3 NA NA NA

我想要的格式是:

  V1 V2 V3 
1 NA NA NA
2 2014 2012 2006
3 NA NA NA

((底部样本数据))



最新尝试:



执行以下操作但仅返回每个 a [[i]] 的第一个开始日期,第二个循环需要使列表 aa [ i] [y] 不同的东西。

  for(i in 1:3){a [对于(y in 1:length(a [[i]])的$ experience 
){aa [i] [y] = if(is。 null(a [[i]] [[y]] $ experience $ start)){NULL} else {a [[i]] [[y]] $ experience $ start}}}

所以对于 dput2 我想要的形式:

  v1 v2 v3 v4 v5 v6 v7 v8 
1 2015
2 2011 2007 null null null null null null
3 2016 2015 2015 2015 2013 2010

我不介意空格是空的还是空的



更新



以下答案几乎可以工作,但是在我的数据中,结构发生变化,名称(roleName,持续时间等)的顺序发生变化,从而将答案遗留为 cumsum 用于确定何时找到新的列表。如果您有持续时间然后开始键是 9 1 cumsum 部分标注他们两个不同的列表。



我写了以下内容:

  my.list<  - 列表(结构(
列表(
experience = structure(
list(
start =1,
end =1,
roleName =a,
summary =a ,
duration =a,
current =a,
org = structure(list(name =a,url =a),.Names = c名称,url)),
location = structure(
list(
displayLocation = NULL,
lat = NULL,
lng = NULL
),
.Names = c(displayLocation,
lat,lng)
)),Name = c(start,end,roleName ,summary,duration,current,org,location)),
`_meta` =结构(
list t = 1L,`_sources` = list(structure(
list(`_origin` =a),.Names =_origin
))) _sources))),。Names = c(experience,_meta)))

然后:

  aa<  -  lapply(1:length(a),function(y){tryCatch(lapply :(a [[y]]),
function(i){a [[y]] [[i]] $ experience [names(my.list2 [[1]] $ experience)]} ,error = function(e)print(list()))})

那么 key2 将始终是正确的顺序。



然而,我发现这个循环后有另一个问题。



有时我在体验列表中只有一个roleName。如果连续两次发生,则重复按键。 cumsum 将它们视为相同的体验,而不是单独的体验。



这意味着由于行的重复标识符,我无法创建 df3 。即使我可以通过删除麻烦的行,名称也不符合 i 在下面的解决方案中使用序列匹配名称,如果我删除任何更改长度



这是我的总代码,以获得更多的洞察力:

  for我在1:x $ count){a [[i]]< -x $ results [[i]] $ experiences} 

aa< - lapply(1:length(a)函数(y){tryCatch(lapply(1:length(a [[y]]),
function(i){a [[y]] [[i]] $ experience [names(my.list2 [ [1]] $)$)$)

aaa< - unlist(aa)
dummydf< data.frame(b = c(start,end,roleName,summary,
duration,current,org.name,org.url = 1:8)

df< - data.frame(a = aaa,b = names(aaa))
df2 < - left_join(df,dummydf)
df2 $ key2< - as.factor(cumsum(df2 $ key< c(0,df2 $ key [-length(df2 $ key)]))+1)

df_split < split(df2,df2 $ key2)
df3< - lapply(df_split,function(x){
x%>%select(-c(key,key2))%>%spread ,a)
})%>%data.table :: rbindlist(fill = TRUE)%>%t
df3< - data.frame(df3)$ b $ (< - sapply(seq_along(aa),function(y)rep(y,sapply(aa,function(x)length(x))[y]))%>%unlist
名称(df3) < - paste0(name(df3),_,i)
df4< - data.frame(t(df3))
df4 $ dates< - as.Date(NA)
df4 $ dates< - as.Date(df4 $ start)
df4< - data.frame(dates = df4 $ dates)
df4< - t(df4)
df4< - data.frame(df4)
名称(df4)< - paste0(名称(df4),_,i)
df4 []< - lapply(df4 [] ,as.character)
l1 < - lapply(split(stack(df4),sub('。* _',',stack(df4)[,2])),'[',1)
df5< - t(do.call(cbindPad,l1))
df5< - data.frame(df5)

cbindpad 取自这个问题



新的示例代码包括问题:

  dput3 = 
list(list(),list(
structure(list(experience = structure(list(
du ration =1,start =2014,
end =3000,roleName =a,
summary =aaa,
org = a),.Names =name),
location = structure(list(displayLocation = NULL,lat = NULL,
lng = NULL),.Names = c(displayLocation lat,lng
))),.Names = c(duration,start,end,roleName,summary,
org )),`_meta` = structure(list(weight = 1L,`_sources` = list(
structure(list(`_origin` =),.Names =_origin))).Names = c(weight,
_sources))),.Names = c(experience,_meta)),
结构(list(
experience = end =3000,
start =2012,duration =2,
roleName =a,summary =aaa,
org = None),.Names =name),
location = structure(list(displayLocation = NULL,lat = NULL,lng = NULL),.Names = c(displayLocation,lat lng))),.Names = c(duration, 开始,结束,角色名称,
总结,组织,位置)),`_meta` =结构(列表(
weight = 1L,`_sources` = list(structure(list(`_origin` =),.Names =_origin))),.Names = c(weight,_sources))).Names = c(experience _
结构(
experience = structure(list(duration =3,
start =2006,end =3000,
roleName =a,summary =aaa,org = structure(list(name =),.Names =name),
location = structure(list(displayLocation = NULL,lat = NULL,lng = NULL),.Names = c(displayLocation,lat,lng))),.Names = c(duration,start,end,roleName,
总结,org,location)),`_meta` = structure(list(weight = 1L,`_sources` = list(structure(list(`_origin` =),.Names =_origin) )),.Names = c(weight,
_sources))),.Names = c(experience,_meta)),
结构(list(
体验ce = structure(list(roleName =a,
location = structure(list(displayLocation = NULL,lat = NULL,lng = NULL)).Names = c(displayLocation,lat )()()()$,$($,$,$,$, `_origin` =)),).Names = c(weight,_sources))),.Names = c(experience,_meta)),
structure(list(
experience = structure(list(roleName =a,
location = structure(list(displayLocation = NULL,lat = NULL,lng = NULL)).Names = c (displayLocation,lat,lng))),.Names = c(roleName,
location)),`_meta` = ,`_sources` = list(structure(list(`_origin` =),.Names =_origin))),.Names = c(weight,_sources))).Names = c 经验,_meta))
),
列表(
结构(list(experience = structure(l ist(
duration =1,start =2014,
end =3000,roleName =a,
summary =aaa,
org =结构(list(name =a),.Names =name),
location = structure(list(displayLocation = NULL,lat = NULL,
lng = NULL),.Names = c (displayLocation,lat,lng
))),.Names = c(duration,start,end,roleName,summary,
_meta= structure(list(weight = 1L,`_sources` = list(
structure(list(`_origin` =),.Names =_origin) )),.Names = c(weight,
_sources))).Names = c(experience,_meta))))


解决方案

也许这可以帮助

 
库(tidyr)

a< - unlist(a)

df< - data.frame(a = a,b = names(a))%>%mutate(key = cumsum(b ==experi ence.duration))%>%
split(。$ key)%>%lapply(function(x)x%>%select(-key)%>%spread(b,a) )%>%
do.call(rbind,。)%>%t%>%data.frame

df $ key< - rownames(df)

然后,您可以过滤感兴趣的行



以上将相当于

  rbind(unlist(a)[1:8],unlist(a)[9: 16],unlist(a)[17:24])%>%t 



h2>

尝试这个 dput2

  a<  -  unlist(dput2)

库(dplyr)
库(tidyr)

dummydf< - data.frame(b = c(experience.start,experience.end,experience.roleName,experience.summary,
experience.org,experience.org.name,experience.org.url ,
_meta.weight,_meta._sources._origin,experience.duration),key = 1:10)


df< - 数据.frame(a = a,b =名称(a))

df2 < - left_join(df,dummydf)
df2 $ key2< - as.factor(cumsum(df2 $ key< c(0,df2 $ key [-length(df2 $ key)]))+1)
df_split< - split(df2,df2 $ key2)
df3< - lapply(df_split,function (x){
x%>%select(-c(key,key2))%>%spread(b,a)
})%>%data.table :: rbindlist = TRUE)%>%t

df3 < - data.frame(df3)
i < - sapply(seq_along(dput2),function(y)rep(y,sapply dput2,function(x)length(x))[y]))%>%unlist
名称(df3)< - paste0(名称(df3),_,i)

查看(df3)


I will put dput of what my list looks like at the bottom such that the q can be reproducible. The dput is of a not x.

I have a big nested list called x that I'm trying to build a data frame from but cannot figure it out.

I have done the first part:

for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
indx <- lengths(a)
zz <- as.data.frame(do.call(rbind,lapply(a, `length<-`, max(indx))))}

For this I used the following answer: Converting nested list (unequal length) to data frame

This leaves me a data.frame with n columns for results where n is the max results for any i:

  v1   v2   v3
1 NULL NULL NULL
2  *    *    *
3 NULL NULL NULL

Each * is another nested list in the format list(experience = list(duration = ...

For example the first * in row 2, column v1. I don't want the total list. I only want:

a[[2]][[1]]$experience$start

or in terms of the original list x:

x$results[[2]]$experiences[[1]]$experience$start

I feel like I'm nearly there with some tweaks. I tried:

for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
indx <- lengths(a)
for(y in 1:length(a[[i]])) aa <- rbind(aa,tryCatch(x$results[[i]]$experiences[[y]]$experience$start, error=function(e) print(NA)))
zz <- as.data.frame(do.call(rbind,lapply(aa, `length<-`, max(indx))))}

Resulting in:

  v1     v2     v3
1  NA     NA     NA
2  NA     NA     NA
3 2014    NA     NA
4 2012    NA     NA
5 2006    NA     NA
6  NA     NA     NA
7  NA     NA     NA 

Tried cbind instead of rbind on final line and that put all the dates in the first row.

I also tried the following:

for(i in 1:3){a[[i]]<-lengths(x$results[[i]]$experiences)
  indx <- lengths(a)
for(y in 1:length(indx)){tt[i] <- tryCatch(x$results[[i]]$experiences[[y]]$experience$start, error=function(e) print(""))}
zz <- as.data.frame(do.call(rbind,lapply(tt, `length<-`, max(indx))))}

This came close, builds the right format but only returns the first result:

  v1   v2  v3
1 NA   NA  NA
2 2014 NA  NA
3 NA   NA  NA

The format I want is:

 V1  V2  V3
1 NA  NA  NA
2 2014 2012 2006
3 NA  NA  NA

((Sample data now at bottom))

Newest attempt:

Doing the following but returns only the first start date from each a[[i]], the second loop I need to make the list aa[i][y] something different.

 for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
 for(y in 1:length(a[[i]])){aa[i][y] = if(is.null(a[[i]][[y]]$experience$start)){"NULL"}else{a[[i]][[y]]$experience$start}}}

So for dput2 I'd like the form:

  v1    v2  v3   v4   v5   v6   v7   v8
1 2015
2 2011 2007 null null null null null null
3 2016 2015 2015 2015 2013 2010

I dont mind if the blanks are null or na

UPDATE

The below answer almost works, however in my data the structure changes, the order of the names (roleName, duration etc) change so that ruins the answer as cumsum is used to determine when a new list is found. If you have duration then start the keys are 9 and 1 and the cumsum part labels them two different lists.

I wrote the following:

my.list <- list(structure(
  list(
    experience = structure(
      list(
        start = "1",
        end = "1",
        roleName = "a",
        summary = "a",
        duration = "a",
        current = "a",
        org = structure(list(name = "a", url = "a"), .Names = c("name","url")),
        location = structure(
          list(
            displayLocation = NULL,
            lat = NULL,
            lng = NULL
          ),
          .Names = c("displayLocation",
                     "lat", "lng")
        ) ),.Names = c("start", "end", "roleName", "summary", "duration", "current", "org", "location")),
    `_meta` = structure(
      list(weight = 1L, `_sources` = list(structure(
        list(`_origin` = "a"), .Names = "_origin"
      ))),.Names = c("weight", "_sources"))),.Names = c("experience", "_meta")))

Then:

aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]), 
                     function(i){a[[y]][[i]]$experience[names(my.list2[[1]]$experience)]}), error=function(e) print(list()))})

This changes the structure such that key2 will always be in the right order.

However Then I found after this loop I have another issue.

Sometimes I have for example nothing but a roleName in the experience list. If that occurs twice in a row the keys are repeated. cumsum treats them as the same experience instead of separate ones.

This means I cannot create df3 because of duplicate identifiers for rows. And even if I could by removing troublesome rows, the names wouldn't match as i in the solution below matches the names using the sequence, if I remove any rows that changes the lengths.

Here is my total code for more insight:

for(i in 1:x$count){a[[i]]<-x$results[[i]]$experiences}

  aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]), 
                     function(i){a[[y]][[i]]$experience[names(my.list2[[1]]$experience)]}), error=function(e) print(list()))})

  aaa <- unlist(aa)
  dummydf <- data.frame(b=c("start", "end", "roleName", "summary", 
                            "duration", "current", "org.name",  "org.url"), key=1:8)

  df <- data.frame(a=aaa, b=names(aaa))
  df2 <- left_join(df, dummydf)
  df2$key2 <- as.factor(cumsum(df2$key < c(0, df2$key[-length(df2$key)])) +1)

  df_split <- split(df2, df2$key2)
  df3 <- lapply(df_split, function(x){
    x %>% select(-c(key, key2)) %>% spread(b, a)
  }) %>% data.table::rbindlist(fill=TRUE) %>% t
  df3 <- data.frame(df3)
  i <- sapply(seq_along(aa), function(y) rep(y, sapply(aa, function(x) length(x))[y])) %>% unlist
  names(df3) <- paste0(names(df3), "_", i)
  df4 <- data.frame(t(df3))
  df4$dates <- as.Date(NA)
  df4$dates <- as.Date(df4$start)
  df4 <- data.frame(dates = df4$dates)
  df4 <- t(df4)
  df4 <- data.frame(df4)
  names(df4) <- paste0(names(df4), "_", i)
  df4[] <- lapply(df4[], as.character)
  l1 <- lapply(split(stack(df4), sub('.*_', '', stack(df4)[,2])), '[', 1)
  df5 <- t(do.call(cbindPad, l1))
  df5 <- data.frame(df5)

cbindpad taken from this question

New sample code including the issues:

dput3 = 
list(list(), list(
structure(list(experience = structure(list(
  duration = "1", start = "2014", 
  end = "3000", roleName = "a", 
  summary = "aaa", 
  org = structure(list(name = "a"), .Names = "name"), 
  location = structure(list(displayLocation = NULL, lat = NULL, 
    lng = NULL), .Names = c("displayLocation", "lat", "lng"
    ))), .Names = c("duration", "start", "end", "roleName", "summary", 
    "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(
      structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", 
      "_sources"))), .Names = c("experience", "_meta")), 
structure(list(
        experience = structure(list(end = "3000", 
        start = "2012", duration = "2", 
        roleName = "a", summary = "aaa", 
        org = structure(list(name = "None"), .Names = "name"), 
        location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("duration", "start", "end", "roleName", 
        "summary", "org", "location")), `_meta` = structure(list(
          weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")), 
  structure(list(
            experience = structure(list(duration = "3", 
            start = "2006", end = "3000", 
            roleName = "a", summary = "aaa", org = structure(list(name = " "), .Names = "name"), 
            location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("duration", "start", "end", "roleName",
            "summary", "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", 
            "_sources"))), .Names = c("experience", "_meta")),
  structure(list(
            experience = structure(list(roleName = "a",  
            location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("roleName", 
           "location")), `_meta` = structure(list(
            weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")),
structure(list(
            experience = structure(list(roleName = "a",  
            location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("roleName", 
            "location")), `_meta` = structure(list(
            weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta"))
            ), 
            list(
structure(list(experience = structure(list(
              duration = "1", start = "2014", 
              end = "3000", roleName = "a", 
              summary = "aaa", 
              org = structure(list(name = "a"), .Names = "name"), 
              location = structure(list(displayLocation = NULL, lat = NULL, 
                lng = NULL), .Names = c("displayLocation", "lat", "lng"
                ))), .Names = c("duration", "start", "end", "roleName", "summary", 
                "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(
                  structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", 
                  "_sources"))), .Names = c("experience", "_meta"))))

解决方案

Maybe this can help

library(dplyr)
library(tidyr)

a <- unlist(a)

df <- data.frame(a=a, b=names(a)) %>% mutate(key=cumsum(b=="experience.duration")) %>% 
      split(.$key) %>% lapply(function(x) x %>% select(-key) %>% spread(b, a)) %>% 
      do.call(rbind, .) %>% t %>% data.frame

df$key <- rownames(df)

Then you can filter in on the rows of interest

The above would be equivalent to

rbind(unlist(a)[1:8], unlist(a)[9:16],unlist(a)[17:24]) %>% t

Update

try this for dput2

a <- unlist(dput2)

library(dplyr)
library(tidyr)

dummydf <- data.frame(b=c("experience.start", "experience.end", "experience.roleName", "experience.summary", 
                      "experience.org", "experience.org.name",  "experience.org.url", 
                      "_meta.weight", "_meta._sources._origin", "experience.duration"), key=1:10)


df <- data.frame(a=a, b=names(a))

df2 <- left_join(df, dummydf)
df2$key2 <- as.factor(cumsum(df2$key < c(0, df2$key[-length(df2$key)])) +1)
df_split <- split(df2, df2$key2)
df3 <- lapply(df_split, function(x){
       x %>% select(-c(key, key2)) %>% spread(b, a)
       }) %>% data.table::rbindlist(fill=TRUE) %>% t

df3 <- data.frame(df3)
i <- sapply(seq_along(dput2), function(y) rep(y, sapply(dput2, function(x) length(x))[y])) %>% unlist
names(df3) <- paste0(names(df3), "_", i)

View(df3)

这篇关于从嵌套列表提取到数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆