使用purr从不同长度的分层列表中提取数据到data.frame中 [英] Extracting data from hierarchical lists of different lengths into `data.frame` using `purr`
问题描述
这是对上一个类似问题的直接跟进,我在提取列表的特定子集时曾问过以下问题:使用`purrr`将列表中的数据提取到其自己的`data.frame`中。
This is a direct follow up to a previous and similar question I asked on extracting a specific subset of a list of lists: Extracting data from a list of lists into its own `data.frame` with `purrr`
因此,我将使用相同的示例数据集:
Hence I will use the same sample dataset:
l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T",
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr",
score = -0.21104594634643), .Names = c("id", "label", "link", "score")), e = 49.1279871269422), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.934821052832427, b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Scoresbysund", score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", "label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", c = "P", d = list(structure(list(id = 8L, label = "Georgia", link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 2L, label = "Washington", link = "America/Shiprock", score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 6L, label = "North Dakota", link = "Universal", score = 1.03168296038975), .Names = c("id", "label", "link", "score")), structure(list(id = 1L, label = "New Hampshire", link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", "label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", "label", "link", "score"))), e = 132.1153538536), .Names = c("a", "e")), structure(list(a = 0.202685974077313, b = "x", c = "O", d = structure(list(id = 3L, label = "Delaware", link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", "label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.396243444741009, b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Ojinaga", score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", "label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", "b", "c", "d", "e")))
我要解决的一般问题是提取嵌套列表的内容,这些内容具有不同的长度,并将它们绑定到同一列表中的其他内容,这些内容本质上用作嵌套内容的ID。
The general issue I am trying to resolve is to extract contents of a nested list which are of varying lengths, and bind them to other contents within the same list which are essentially being used as IDs for the nested contents.
在上述示例数据集中,我试图将子列表 d
的内容提取到 data.table
/ data.frame
,但也提取并实质上重复 a中的数据
的每个元素-这样一来,我就能理解 d
中的哪些提取元素由于长度不同而属于同一子集。所需的 data.table
的示例将最好地说明:
In the context of the above sample dataset, I am trying to extract the contents of the sublist d
into a data.table
/data.frame
, but also extract and essentially repeat the data in a
for each element -- so that I can understand which extracted elements in d
belong in the same subset, due to their differing lengths. An example of the desired data.table
will explain best:
a id label link score externalId
-1.5467647 5 Utah Asia/Anadyr -0.2110459 NA
-0.9348211 8 South Carolina Pacific/Wallis 0.5265409 -6.743544
-0.9348211 9 Nebraska America/Scoresbysund 0.2508955 16.42575
请注意,第一列 a
是 l
中的第一个子列表。第一行是 d
(长度1)中第一个嵌套项的内容,第二行和第三行是 d
(长度2),因此 a
中的值与 -0.9348211
相同。
Note that the first column a
is the contents of the first sublist within l
. The first row is the content from the first nested item in d
(length 1), then the second and third row is the content from the second item in d
(length 2) hence the value in a
is the same -0.9348211
.
目前,我实现此目标的解决方案是回旋的,并且容易出错-考虑到与上面引用的帖子的关系,我是否想知道
At present my solutions of accomplishing this are in a round-about fashion, and prone to error -- and given the relation to the referenced post above, I wonder if I am not understanding the solution to be able to extend it to this related problem.
推荐答案
每个嵌套列表往往要求稍有不同方法,但这涵盖了一些典型的方法:
Each nested list tends to require a slightly different approach, but this covers some typical ones:
library(tidyverse)
l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T",
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr",
score = -0.21104594634643), .Names = c("id", "label", "link", "score")), e = 49.1279871269422), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.934821052832427, b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Scoresbysund", score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", "label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", c = "P", d = list(structure(list(id = 8L, label = "Georgia", link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 2L, label = "Washington", link = "America/Shiprock", score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 6L, label = "North Dakota", link = "Universal", score = 1.03168296038975), .Names = c("id", "label", "link", "score")), structure(list(id = 1L, label = "New Hampshire", link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", "label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", "label", "link", "score"))), e = 132.1153538536), .Names = c("a", "e")), structure(list(a = 0.202685974077313, b = "x", c = "O", d = structure(list(id = 3L, label = "Delaware", link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", "label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.396243444741009, b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Ojinaga", score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", "label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", "b", "c", "d", "e")))
l %>%
map(set_names, letters[1:5]) %>% # add missing names
map(modify_at, 'd', bind_rows) %>% # coerce nested elements to data.frame
# make each element to a data.frame, and rbind them all together
map_df(data.frame, stringsAsFactors = FALSE)
#> a b c d.id d.label d.link d.score e d.externalId
#> 1 -1.5467647 s T 5 Utah Asia/Anadyr -0.2110459 49.12799 NA
#> 2 -0.9348211 k T 8 South Carolina Pacific/Wallis 0.5265409 52.31614 -6.743544
#> 3 -0.9348211 k T 9 Nebraska America/Scoresbysund 0.2508955 52.31614 16.425747
#> 4 -0.2726149 f P 8 Georgia America/Nome 0.5264941 132.11535 7.915836
#> 5 -0.2726149 f P 2 Washington America/Shiprock -0.5551864 132.11535 15.068666
#> 6 -0.2726149 f P 6 North Dakota Universal 1.0316830 132.11535 NA
#> 7 -0.2726149 f P 1 New Hampshire America/Cordoba 1.2158206 132.11535 9.727642
#> 8 -0.2726149 f P 1 Alaska Asia/Istanbul -0.2318326 132.11535 NA
#> 9 -0.2726149 f P 4 Pennsylvania Africa/Dar_es_Salaam 0.5902453 132.11535 NA
#> 10 0.2026860 x O 3 Delaware Asia/Samarkand 0.6955771 97.99089 15.236482
#> 11 -0.3962434 z P 4 North Dakota America/Tortola 1.0306027 123.59795 -7.216669
#> 12 -0.3962434 z P 9 Nebraska America/Ojinaga -1.1139800 123.59795 -8.451451
还有很多方法可以这样做,但是关键是首先将嵌套最多的元素安排到适当的数据结构中,然后将它们与其余元素组合起来,直到拥有data.frame。
There are many more ways to do this, but the key is to start by arranging the most nested elements into the proper data structure, and then combining them with the remaining elements until you have a data.frame.
请注意,在这里使用 data.frame
而不是小玩意儿有点麻烦,但是data.frame是将data.frame和值都混为一个data.frame更好,可以根据需要进行回收。使用tidyverse版本将需要使所有内容的长度正确,而不是依赖于回收。
Note that using data.frame
instead of a tibble equivalent is a little hacky here, but data.frame is much better at slurping up both data.frames and values into a single data.frame, recycling as necessary. Using a tidyverse version would require making everything the correct length instead of relying on recycling.
这篇关于使用purr从不同长度的分层列表中提取数据到data.frame中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!