使用purr从不同长度的分层列表中提取数据到data.frame中 [英] Extracting data from hierarchical lists of different lengths into `data.frame` using `purr`

查看:67
本文介绍了使用purr从不同长度的分层列表中提取数据到data.frame中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是对上一个类似问题的直接跟进,我在提取列表的特定子集时曾问过以下问题:使用`purrr`将列表中的数据提取到其自己的`data.frame`中。

This is a direct follow up to a previous and similar question I asked on extracting a specific subset of a list of lists: Extracting data from a list of lists into its own `data.frame` with `purrr`

因此,我将使用相同的示例数据集:

Hence I will use the same sample dataset:

l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T", 
                     d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr",
                                        score = -0.21104594634643), .Names = c("id", "label", "link", "score")), e = 49.1279871269422), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.934821052832427, b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Scoresbysund", score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", "label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", c = "P", d = list(structure(list(id = 8L, label = "Georgia", link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 2L, label = "Washington", link = "America/Shiprock", score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 6L, label = "North Dakota", link = "Universal", score = 1.03168296038975), .Names = c("id", "label", "link", "score")), structure(list(id = 1L, label = "New Hampshire", link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", "label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", "label", "link", "score"))), e = 132.1153538536), .Names = c("a", "e")), structure(list(a = 0.202685974077313, b = "x", c = "O", d = structure(list(id = 3L, label = "Delaware", link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", "label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.396243444741009, b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Ojinaga", score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", "label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", "b", "c", "d", "e")))

我要解决的一般问题是提取嵌套列表的内容,这些内容具有不同的长度,并将它们绑定到同一列表中的其他内容,这些内容本质上用作嵌套内容的ID。

The general issue I am trying to resolve is to extract contents of a nested list which are of varying lengths, and bind them to other contents within the same list which are essentially being used as IDs for the nested contents.

在上述示例数据集中,我试图将子列表 d 的内容提取到 data.table / data.frame ,但也提取并实质上重复 a中的数据的每个元素-这样一来,我就能理解 d 中的哪些提取元素由于长度不同而属于同一子集。所需的 data.table 的示例将最好地说明:

In the context of the above sample dataset, I am trying to extract the contents of the sublist d into a data.table/data.frame, but also extract and essentially repeat the data in a for each element -- so that I can understand which extracted elements in d belong in the same subset, due to their differing lengths. An example of the desired data.table will explain best:

a          id           label                        link       score  externalId
-1.5467647  5            Utah                 Asia/Anadyr  -0.2110459          NA
-0.9348211  8  South Carolina              Pacific/Wallis   0.5265409   -6.743544
-0.9348211  9        Nebraska        America/Scoresbysund   0.2508955    16.42575

请注意,第一列 a l 中的第一个子列表。第一行是 d (长度1)中第一个嵌套项的内容,第二行和第三行是 d (长度2),因此 a 中的值与 -0.9348211 相同。

Note that the first column a is the contents of the first sublist within l. The first row is the content from the first nested item in d (length 1), then the second and third row is the content from the second item in d (length 2) hence the value in a is the same -0.9348211.

目前,我实现此目标的解决方案是回旋的,并且容易出错-考虑到与上面引用的帖子的关系,我是否想知道

At present my solutions of accomplishing this are in a round-about fashion, and prone to error -- and given the relation to the referenced post above, I wonder if I am not understanding the solution to be able to extend it to this related problem.

推荐答案

每个嵌套列表往往要求稍有不同方法,但这涵盖了一些典型的方法:

Each nested list tends to require a slightly different approach, but this covers some typical ones:

library(tidyverse)

l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T", 
                     d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr",
                                        score = -0.21104594634643), .Names = c("id", "label", "link", "score")), e = 49.1279871269422), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.934821052832427, b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Scoresbysund", score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", "label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", c = "P", d = list(structure(list(id = 8L, label = "Georgia", link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 2L, label = "Washington", link = "America/Shiprock", score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 6L, label = "North Dakota", link = "Universal", score = 1.03168296038975), .Names = c("id", "label", "link", "score")), structure(list(id = 1L, label = "New Hampshire", link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", "label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", "label", "link", "score"))), e = 132.1153538536), .Names = c("a", "e")), structure(list(a = 0.202685974077313, b = "x", c = "O", d = structure(list(id = 3L, label = "Delaware", link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", "label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", "b", "c", "d", "e")), structure(list(a = -0.396243444741009, b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", "label", "link", "score", "externalId")), structure(list(id = 9L, label = "Nebraska", link = "America/Ojinaga", score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", "label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", "b", "c", "d", "e")))

l %>% 
    map(set_names, letters[1:5]) %>%    # add missing names
    map(modify_at, 'd', bind_rows) %>%    # coerce nested elements to data.frame
    # make each element to a data.frame, and rbind them all together
    map_df(data.frame, stringsAsFactors = FALSE)
#>             a b c d.id        d.label               d.link    d.score         e d.externalId
#> 1  -1.5467647 s T    5           Utah          Asia/Anadyr -0.2110459  49.12799           NA
#> 2  -0.9348211 k T    8 South Carolina       Pacific/Wallis  0.5265409  52.31614    -6.743544
#> 3  -0.9348211 k T    9       Nebraska America/Scoresbysund  0.2508955  52.31614    16.425747
#> 4  -0.2726149 f P    8        Georgia         America/Nome  0.5264941 132.11535     7.915836
#> 5  -0.2726149 f P    2     Washington     America/Shiprock -0.5551864 132.11535    15.068666
#> 6  -0.2726149 f P    6   North Dakota            Universal  1.0316830 132.11535           NA
#> 7  -0.2726149 f P    1  New Hampshire      America/Cordoba  1.2158206 132.11535     9.727642
#> 8  -0.2726149 f P    1         Alaska        Asia/Istanbul -0.2318326 132.11535           NA
#> 9  -0.2726149 f P    4   Pennsylvania Africa/Dar_es_Salaam  0.5902453 132.11535           NA
#> 10  0.2026860 x O    3       Delaware       Asia/Samarkand  0.6955771  97.99089    15.236482
#> 11 -0.3962434 z P    4   North Dakota      America/Tortola  1.0306027 123.59795    -7.216669
#> 12 -0.3962434 z P    9       Nebraska      America/Ojinaga -1.1139800 123.59795    -8.451451

还有很多方法可以这样做,但是关键是首先将嵌套最多的元素安排到适当的数据结构中,然后将它们与其余元素组合起来,直到拥有data.frame。

There are many more ways to do this, but the key is to start by arranging the most nested elements into the proper data structure, and then combining them with the remaining elements until you have a data.frame.

请注意,在这里使用 data.frame 而不是小玩意儿有点麻烦,但是data.frame是将data.frame和值都混为一个data.frame更好,可以根据需要进行回收。使用tidyverse版本将需要使所有内容的长度正确,而不是依赖于回收。

Note that using data.frame instead of a tibble equivalent is a little hacky here, but data.frame is much better at slurping up both data.frames and values into a single data.frame, recycling as necessary. Using a tidyverse version would require making everything the correct length instead of relying on recycling.

这篇关于使用purr从不同长度的分层列表中提取数据到data.frame中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆