展平具有复杂嵌套结构的列表 [英] Flatten a list with complex nested structure
问题描述
我有一个具有以下示例结构的列表:
I have a list with the following example structure:
> dput(test)
structure(list(id = 1, var1 = 2, var3 = 4, section1 = structure(list(
var1 = 1, var2 = 2, var3 = 3), .Names = c("var1", "var2",
"var3")), section2 = structure(list(row = structure(list(var1 = 1,
var2 = 2, var3 = 3), .Names = c("var1", "var2", "var3")),
row = structure(list(var1 = 4, var2 = 5, var3 = 6), .Names = c("var1",
"var2", "var3")), row = structure(list(var1 = 7, var2 = 8,
var3 = 9), .Names = c("var1", "var2", "var3"))), .Names = c("row",
"row", "row"))), .Names = c("id", "var1", "var3", "section1",
"section2"))
> str(test)
List of 5
$ id : num 1
$ var1 : num 2
$ var3 : num 4
$ section1:List of 3
..$ var1: num 1
..$ var2: num 2
..$ var3: num 3
$ section2:List of 3
..$ row:List of 3
.. ..$ var1: num 1
.. ..$ var2: num 2
.. ..$ var3: num 3
..$ row:List of 3
.. ..$ var1: num 4
.. ..$ var2: num 5
.. ..$ var3: num 6
..$ row:List of 3
.. ..$ var1: num 7
.. ..$ var2: num 8
.. ..$ var3: num 9
请注意, section2
列表包含名为 rows
的元素.这些代表多个记录.我所拥有的是一个嵌套列表,其中某些元素处于根级别,而其他元素是同一观察的多个嵌套记录.我想要 data.frame
格式的以下输出:
Notice that the section2
list contains elements named rows
. These represent multiple records. What I have is a nested list where some elements are at the root level and others are multiple nested records for the same observation. I would like the following output in a data.frame
format:
> desired
id var1 var3 section1.var1 section1.var2 section1.var3 section2.var1 section2.var2 section2.var3
1 1 2 4 1 2 3 1 4 7
2 NA NA NA NA NA NA 2 5 8
3 NA NA NA NA NA NA 3 6 9
根级元素应填充第一行,而 row
元素应具有自己的行.更为复杂的是, row
条目中的变量数量可能会有所不同.
Root-level elements should populate the first row, while row
elements should have their own rows. As an added complication, the number of variables in the row
entries can vary.
推荐答案
这是一种通用方法.它不假设您只有三行;它将与您拥有的任何行一起工作.而且,如果嵌套结构中缺少值(例如,第2节中的某些子列表不存在var1),则代码会正确返回该单元格的NA.
Here's a general approach. It doesn't assume that you'll have only three row; it will work with however many rows you have. And if a value is missing in the nested structure (e.g. var1 doesn't exist for some sub-lists in section2), the code correctly returns an NA for that cell.
例如如果我们使用以下数据:
E.g. if we use the following data:
test <- structure(list(id = 1, var1 = 2, var3 = 4, section1 = structure(list(var1 = 1, var2 = 2, var3 = 3), .Names = c("var1", "var2", "var3")), section2 = structure(list(row = structure(list(var1 = 1, var2 = 2), .Names = c("var1", "var2")), row = structure(list(var1 = 4, var2 = 5), .Names = c("var1", "var2")), row = structure(list( var2 = 8, var3 = 9), .Names = c("var2", "var3"))), .Names = c("row", "row", "row"))), .Names = c("id", "var1", "var3", "section1", "section2"))
通常的方法是使用melt创建一个包含有关嵌套结构信息的数据框,然后进行广播以将其成型为所需的格式.
The general approach is to use melt to create a dataframe that includes information about the nested structure, and then dcast to mold it into the format you desire.
library("reshape2")
flat <- unlist(test, recursive=FALSE)
names(flat)[grep("row", names(flat))] <- gsub("row", "var", paste0(names(flat)[grep("row", names(flat))], seq_len(length(names(flat)[grep("row", names(flat))])))) ## keeps track of rows by adding an ID
ul <- melt(unlist(flat))
split <- strsplit(rownames(ul), split=".", fixed=TRUE) ## splits the names into component parts
max <- max(unlist(lapply(split, FUN=length)))
pad <- function(a) {
c(a, rep(NA, max-length(a)))
}
levels <- matrix(unlist(lapply(split, FUN=pad)), ncol=max, byrow=TRUE)
## Get the nesting structure
nested <- data.frame(levels, ul)
nested$X3[is.na(nested$X3)] <- levels(as.factor(nested$X3))[[1]]
desired <- dcast(nested, X3~X1 + X2)
names(desired) <- gsub("_", "\\.", gsub("_NA", "", names(desired)))
desired <- desired[,names(flat)]
> desired
## id var1 var3 section1.var1 section1.var2 section1.var3 section2.var1 section2.var2 section2.var3
## 1 1 2 4 1 2 3 1 4 7
## 2 NA NA NA NA NA NA 2 5 8
## 3 NA NA NA NA NA NA 3 6 9
这篇关于展平具有复杂嵌套结构的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!