从嵌套列表提取到数据框 [英] Extracting from Nested list to data frame
问题描述
我将把底部的列表看起来像 c> dput ,使得q可以重现。 dput是 a
不是 x
。
我有一个名为 x
的大嵌套列表,我正在尝试构建一个数据框架,但无法弄清楚。
<我已经做了第一部分:
for(i in 1:3){a [[i]]< -x $ results [[i]] $ experience
indx < - length(a)
zz < - as.data.frame(do.call(rbind,lapply(a,`length< -a,max(indx)))}}
为此,我使用了以下答案:
将嵌套列表(不等长度)转换为数据框
这给我留下了一个数据框,其中n列是n列,其中n是任何i的最大结果:
v1 v2 v3
/ pre>
1 NULL NULL NULL
2 * * *
3 NULL NULL NULL
每个*是格式为
的另一个嵌套列表(experience = list(duration = ...
例如第2行第1列中的第一个
*
我不想要总列表。我只想要:a [[2]] [[1]] $ experience $ start
或根据原始列表x:
x $ results [[2]] $ experience [[1]] $ experience $ start
我觉得我差不多有了一些调整。我试过:for(i in 1:3){a [[i]]< -x $ results [[ i]] $ a
indx< - length(a)
for(y in 1:length(a [[i]]))aa< - rbind(aa,tryCatch(x $ results $ [$] $ experience $ start,error = function(e)print(NA))
zz< - as.data.frame(do.call(rbind,lapply (aa,`length< -`,max(indx)))}}
导致: / p>
v1 v2 v3
1 NA NA NA
2 NA NA NA
3 2014 NA NA
4 2012 NA NA
5 2006 NA NA
6 NA NA NA
7 NA NA NA
在最后一行尝试cbind而不是rbind,并将所有日期放在第一行。
我还尝试了以下:
for(i in 1:3){a [[i]]< -lengths(x $ results [[ ($)
indx< - length(a)
for(y in 1:length(indx)){tt [i]< - tryCatch(x $ results [[i] $ experience [[y]] $ experience $ start,error = fu (e)print())}
zz< - as.data.frame(do.call(rbind,lapply(tt,`length< -`,max(indx)))}
这是接近的,建立正确的格式,但只返回第一个结果:
v1 v2 v3
1 NA NA NA
2 2014 NA NA
3 NA NA NA
我想要的格式是:
V1 V2 V3
1 NA NA NA
2 2014 2012 2006
3 NA NA NA
((底部样本数据))
最新尝试:
执行以下操作但仅返回每个
a [[i]]
的第一个开始日期,第二个循环需要使列表aa [ i] [y]
不同的东西。for(i in 1:3){a [对于(y in 1:length(a [[i]])的$ experience
){aa [i] [y] = if(is。 null(a [[i]] [[y]] $ experience $ start)){NULL} else {a [[i]] [[y]] $ experience $ start}}}
所以对于
dput2
我想要的形式:v1 v2 v3 v4 v5 v6 v7 v8
1 2015
2 2011 2007 null null null null null null
3 2016 2015 2015 2015 2013 2010
我不介意空格是空的还是空的
更新
以下答案几乎可以工作,但是在我的数据中,结构发生变化,名称(roleName,持续时间等)的顺序发生变化,从而将答案遗留为
cumsum
用于确定何时找到新的列表。如果您有持续时间
然后开始
键是9
和1
和cumsum
部分标注他们两个不同的列表。
我写了以下内容:
my.list< - 列表(结构(
列表(
experience = structure(
list(
start =1,
end =1,
roleName =a,
summary =a ,
duration =a,
current =a,
org = structure(list(name =a,url =a),.Names = c名称,url)),
location = structure(
list(
displayLocation = NULL,
lat = NULL,
lng = NULL
),
.Names = c(displayLocation,
lat,lng)
)),Name = c(start,end,roleName ,summary,duration,current,org,location)),
`_meta` =结构(
list t = 1L,`_sources` = list(structure(
list(`_origin` =a),.Names =_origin
))) _sources))),。Names = c(experience,_meta)))
然后:
aa< - lapply(1:length(a),function(y){tryCatch(lapply :(a [[y]]),
function(i){a [[y]] [[i]] $ experience [names(my.list2 [[1]] $ experience)]} ,error = function(e)print(list()))})
那么
key2
将始终是正确的顺序。
然而,我发现这个循环后有另一个问题。
有时我在体验列表中只有一个roleName。如果连续两次发生,则重复按键。
cumsum
将它们视为相同的体验,而不是单独的体验。
这意味着由于行的重复标识符,我无法创建
df3
。即使我可以通过删除麻烦的行,名称也不符合i
在下面的解决方案中使用序列匹配名称,如果我删除任何更改长度
这是我的总代码,以获得更多的洞察力:
for我在1:x $ count){a [[i]]< -x $ results [[i]] $ experiences}
aa< - lapply(1:length(a)函数(y){tryCatch(lapply(1:length(a [[y]]),
function(i){a [[y]] [[i]] $ experience [names(my.list2 [ [1]] $)$)$)
aaa< - unlist(aa)
dummydf< data.frame(b = c(start,end,roleName,summary,
duration,current,org.name,org.url = 1:8)
df< - data.frame(a = aaa,b = names(aaa))
df2 < - left_join(df,dummydf)
df2 $ key2< - as.factor(cumsum(df2 $ key< c(0,df2 $ key [-length(df2 $ key)]))+1)
df_split < split(df2,df2 $ key2)
df3< - lapply(df_split,function(x){
x%>%select(-c(key,key2))%>%spread ,a)
})%>%data.table :: rbindlist(fill = TRUE)%>%t
df3< - data.frame(df3)$ b $ (< - sapply(seq_along(aa),function(y)rep(y,sapply(aa,function(x)length(x))[y]))%>%unlist
名称(df3) < - paste0(name(df3),_,i)
df4< - data.frame(t(df3))
df4 $ dates< - as.Date(NA)
df4 $ dates< - as.Date(df4 $ start)
df4< - data.frame(dates = df4 $ dates)
df4< - t(df4)
df4< - data.frame(df4)
名称(df4)< - paste0(名称(df4),_,i)
df4 []< - lapply(df4 [] ,as.character)
l1 < - lapply(split(stack(df4),sub('。* _',',stack(df4)[,2])),'[',1)
df5< - t(do.call(cbindPad,l1))
df5< - data.frame(df5)
cbindpad
取自这个问题
新的示例代码包括问题:
dput3 =
list(list(),list(
structure(list(experience = structure(list(
du ration =1,start =2014,
end =3000,roleName =a,
summary =aaa,
org = a),.Names =name),
location = structure(list(displayLocation = NULL,lat = NULL,
lng = NULL),.Names = c(displayLocation lat,lng
))),.Names = c(duration,start,end,roleName,summary,
org )),`_meta` = structure(list(weight = 1L,`_sources` = list(
structure(list(`_origin` =),.Names =_origin))).Names = c(weight,
_sources))),.Names = c(experience,_meta)),
结构(list(
experience = end =3000,
start =2012,duration =2,
roleName =a,summary =aaa,
org = None),.Names =name),
location = structure(list(displayLocation = NULL,lat = NULL,lng = NULL),.Names = c(displayLocation,lat lng))),.Names = c(duration, 开始,结束,角色名称,
总结,组织,位置)),`_meta` =结构(列表(
weight = 1L,`_sources` = list(structure(list(`_origin` =),.Names =_origin))),.Names = c(weight,_sources))).Names = c(experience _
结构(
experience = structure(list(duration =3,
start =2006,end =3000,
roleName =a,summary =aaa,org = structure(list(name =),.Names =name),
location = structure(list(displayLocation = NULL,lat = NULL,lng = NULL),.Names = c(displayLocation,lat,lng))),.Names = c(duration,start,end,roleName,
总结,org,location)),`_meta` = structure(list(weight = 1L,`_sources` = list(structure(list(`_origin` =),.Names =_origin) )),.Names = c(weight,
_sources))),.Names = c(experience,_meta)),
结构(list(
体验ce = structure(list(roleName =a,
location = structure(list(displayLocation = NULL,lat = NULL,lng = NULL)).Names = c(displayLocation,lat )()()()$,$($,$,$,$, `_origin` =)),).Names = c(weight,_sources))),.Names = c(experience,_meta)),
structure(list(
experience = structure(list(roleName =a,
location = structure(list(displayLocation = NULL,lat = NULL,lng = NULL)).Names = c (displayLocation,lat,lng))),.Names = c(roleName,
location)),`_meta` = ,`_sources` = list(structure(list(`_origin` =),.Names =_origin))),.Names = c(weight,_sources))).Names = c 经验,_meta))
),
列表(
结构(list(experience = structure(l ist(
duration =1,start =2014,
end =3000,roleName =a,
summary =aaa,
org =结构(list(name =a),.Names =name),
location = structure(list(displayLocation = NULL,lat = NULL,
lng = NULL),.Names = c (displayLocation,lat,lng
))),.Names = c(duration,start,end,roleName,summary,
_meta= structure(list(weight = 1L,`_sources` = list(
structure(list(`_origin` =),.Names =_origin) )),.Names = c(weight,
_sources))).Names = c(experience,_meta))))
解决方案也许这可以帮助
库(tidyr)
a< - unlist(a)
df< - data.frame(a = a,b = names(a))%>%mutate(key = cumsum(b ==experi ence.duration))%>%
split(。$ key)%>%lapply(function(x)x%>%select(-key)%>%spread(b,a) )%>%
do.call(rbind,。)%>%t%>%data.frame
df $ key< - rownames(df)
然后,您可以过滤感兴趣的行
以上将相当于
rbind(unlist(a)[1:8],unlist(a)[9: 16],unlist(a)[17:24])%>%t
h2>
尝试这个
dput2
a< - unlist(dput2)
库(dplyr)
库(tidyr)
dummydf< - data.frame(b = c(experience.start,experience.end,experience.roleName,experience.summary,
experience.org,experience.org.name,experience.org.url ,
_meta.weight,_meta._sources._origin,experience.duration),key = 1:10)
df< - 数据.frame(a = a,b =名称(a))
df2 < - left_join(df,dummydf)
df2 $ key2< - as.factor(cumsum(df2 $ key< c(0,df2 $ key [-length(df2 $ key)]))+1)
df_split< - split(df2,df2 $ key2)
df3< - lapply(df_split,function (x){
x%>%select(-c(key,key2))%>%spread(b,a)
})%>%data.table :: rbindlist = TRUE)%>%t
df3 < - data.frame(df3)
i < - sapply(seq_along(dput2),function(y)rep(y,sapply dput2,function(x)length(x))[y]))%>%unlist
名称(df3)< - paste0(名称(df3),_,i)
查看(df3)
I will put
dput
of what my list looks like at the bottom such that the q can be reproducible. The dput is ofa
notx
.I have a big nested list called
x
that I'm trying to build a data frame from but cannot figure it out.I have done the first part:
for(i in 1:3){a[[i]]<-x$results[[i]]$experiences indx <- lengths(a) zz <- as.data.frame(do.call(rbind,lapply(a, `length<-`, max(indx))))}
For this I used the following answer: Converting nested list (unequal length) to data frame
This leaves me a data.frame with n columns for results where n is the max results for any i:
v1 v2 v3 1 NULL NULL NULL 2 * * * 3 NULL NULL NULL
Each * is another nested list in the format
list(experience = list(duration = ...
For example the first
*
in row 2, column v1. I don't want the total list. I only want:a[[2]][[1]]$experience$start
or in terms of the original list x:
x$results[[2]]$experiences[[1]]$experience$start
I feel like I'm nearly there with some tweaks. I tried:
for(i in 1:3){a[[i]]<-x$results[[i]]$experiences indx <- lengths(a) for(y in 1:length(a[[i]])) aa <- rbind(aa,tryCatch(x$results[[i]]$experiences[[y]]$experience$start, error=function(e) print(NA))) zz <- as.data.frame(do.call(rbind,lapply(aa, `length<-`, max(indx))))}
Resulting in:
v1 v2 v3 1 NA NA NA 2 NA NA NA 3 2014 NA NA 4 2012 NA NA 5 2006 NA NA 6 NA NA NA 7 NA NA NA
Tried cbind instead of rbind on final line and that put all the dates in the first row.
I also tried the following:
for(i in 1:3){a[[i]]<-lengths(x$results[[i]]$experiences) indx <- lengths(a) for(y in 1:length(indx)){tt[i] <- tryCatch(x$results[[i]]$experiences[[y]]$experience$start, error=function(e) print(""))} zz <- as.data.frame(do.call(rbind,lapply(tt, `length<-`, max(indx))))}
This came close, builds the right format but only returns the first result:
v1 v2 v3 1 NA NA NA 2 2014 NA NA 3 NA NA NA
The format I want is:
V1 V2 V3 1 NA NA NA 2 2014 2012 2006 3 NA NA NA
((Sample data now at bottom))
Newest attempt:
Doing the following but returns only the first start date from each
a[[i]]
, the second loop I need to make the listaa[i][y]
something different.for(i in 1:3){a[[i]]<-x$results[[i]]$experiences for(y in 1:length(a[[i]])){aa[i][y] = if(is.null(a[[i]][[y]]$experience$start)){"NULL"}else{a[[i]][[y]]$experience$start}}}
So for
dput2
I'd like the form:v1 v2 v3 v4 v5 v6 v7 v8 1 2015 2 2011 2007 null null null null null null 3 2016 2015 2015 2015 2013 2010
I dont mind if the blanks are null or na
UPDATE
The below answer almost works, however in my data the structure changes, the order of the names (roleName, duration etc) change so that ruins the answer as
cumsum
is used to determine when a new list is found. If you haveduration
thenstart
the keys are9
and1
and thecumsum
part labels them two different lists.I wrote the following:
my.list <- list(structure( list( experience = structure( list( start = "1", end = "1", roleName = "a", summary = "a", duration = "a", current = "a", org = structure(list(name = "a", url = "a"), .Names = c("name","url")), location = structure( list( displayLocation = NULL, lat = NULL, lng = NULL ), .Names = c("displayLocation", "lat", "lng") ) ),.Names = c("start", "end", "roleName", "summary", "duration", "current", "org", "location")), `_meta` = structure( list(weight = 1L, `_sources` = list(structure( list(`_origin` = "a"), .Names = "_origin" ))),.Names = c("weight", "_sources"))),.Names = c("experience", "_meta")))
Then:
aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]), function(i){a[[y]][[i]]$experience[names(my.list2[[1]]$experience)]}), error=function(e) print(list()))})
This changes the structure such that
key2
will always be in the right order.However Then I found after this loop I have another issue.
Sometimes I have for example nothing but a roleName in the experience list. If that occurs twice in a row the keys are repeated.
cumsum
treats them as the same experience instead of separate ones.This means I cannot create
df3
because of duplicate identifiers for rows. And even if I could by removing troublesome rows, the names wouldn't match asi
in the solution below matches the names using the sequence, if I remove any rows that changes the lengths.Here is my total code for more insight:
for(i in 1:x$count){a[[i]]<-x$results[[i]]$experiences} aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]), function(i){a[[y]][[i]]$experience[names(my.list2[[1]]$experience)]}), error=function(e) print(list()))}) aaa <- unlist(aa) dummydf <- data.frame(b=c("start", "end", "roleName", "summary", "duration", "current", "org.name", "org.url"), key=1:8) df <- data.frame(a=aaa, b=names(aaa)) df2 <- left_join(df, dummydf) df2$key2 <- as.factor(cumsum(df2$key < c(0, df2$key[-length(df2$key)])) +1) df_split <- split(df2, df2$key2) df3 <- lapply(df_split, function(x){ x %>% select(-c(key, key2)) %>% spread(b, a) }) %>% data.table::rbindlist(fill=TRUE) %>% t df3 <- data.frame(df3) i <- sapply(seq_along(aa), function(y) rep(y, sapply(aa, function(x) length(x))[y])) %>% unlist names(df3) <- paste0(names(df3), "_", i) df4 <- data.frame(t(df3)) df4$dates <- as.Date(NA) df4$dates <- as.Date(df4$start) df4 <- data.frame(dates = df4$dates) df4 <- t(df4) df4 <- data.frame(df4) names(df4) <- paste0(names(df4), "_", i) df4[] <- lapply(df4[], as.character) l1 <- lapply(split(stack(df4), sub('.*_', '', stack(df4)[,2])), '[', 1) df5 <- t(do.call(cbindPad, l1)) df5 <- data.frame(df5)
cbindpad
taken from this questionNew sample code including the issues:
dput3 = list(list(), list( structure(list(experience = structure(list( duration = "1", start = "2014", end = "3000", roleName = "a", summary = "aaa", org = structure(list(name = "a"), .Names = "name"), location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng" ))), .Names = c("duration", "start", "end", "roleName", "summary", "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list( structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")), structure(list( experience = structure(list(end = "3000", start = "2012", duration = "2", roleName = "a", summary = "aaa", org = structure(list(name = "None"), .Names = "name"), location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("duration", "start", "end", "roleName", "summary", "org", "location")), `_meta` = structure(list( weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")), structure(list( experience = structure(list(duration = "3", start = "2006", end = "3000", roleName = "a", summary = "aaa", org = structure(list(name = " "), .Names = "name"), location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("duration", "start", "end", "roleName", "summary", "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")), structure(list( experience = structure(list(roleName = "a", location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("roleName", "location")), `_meta` = structure(list( weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")), structure(list( experience = structure(list(roleName = "a", location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("roleName", "location")), `_meta` = structure(list( weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")) ), list( structure(list(experience = structure(list( duration = "1", start = "2014", end = "3000", roleName = "a", summary = "aaa", org = structure(list(name = "a"), .Names = "name"), location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng" ))), .Names = c("duration", "start", "end", "roleName", "summary", "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list( structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta"))))
解决方案Maybe this can help
library(dplyr) library(tidyr) a <- unlist(a) df <- data.frame(a=a, b=names(a)) %>% mutate(key=cumsum(b=="experience.duration")) %>% split(.$key) %>% lapply(function(x) x %>% select(-key) %>% spread(b, a)) %>% do.call(rbind, .) %>% t %>% data.frame df$key <- rownames(df)
Then you can filter in on the rows of interest
The above would be equivalent to
rbind(unlist(a)[1:8], unlist(a)[9:16],unlist(a)[17:24]) %>% t
Update
try this for
dput2
a <- unlist(dput2) library(dplyr) library(tidyr) dummydf <- data.frame(b=c("experience.start", "experience.end", "experience.roleName", "experience.summary", "experience.org", "experience.org.name", "experience.org.url", "_meta.weight", "_meta._sources._origin", "experience.duration"), key=1:10) df <- data.frame(a=a, b=names(a)) df2 <- left_join(df, dummydf) df2$key2 <- as.factor(cumsum(df2$key < c(0, df2$key[-length(df2$key)])) +1) df_split <- split(df2, df2$key2) df3 <- lapply(df_split, function(x){ x %>% select(-c(key, key2)) %>% spread(b, a) }) %>% data.table::rbindlist(fill=TRUE) %>% t df3 <- data.frame(df3) i <- sapply(seq_along(dput2), function(y) rep(y, sapply(dput2, function(x) length(x))[y])) %>% unlist names(df3) <- paste0(names(df3), "_", i) View(df3)
这篇关于从嵌套列表提取到数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!