用`purrr`从列表列表中提取数据到它自己的`data.frame`中 [英] Extracting data from a list of lists into its own `data.frame` with `purrr`

查看:114
本文介绍了用`purrr`从列表列表中提取数据到它自己的`data.frame`中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

代表性样本数据(列表列表):

Representative sample data (list of lists):

l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T", 
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr", 
    score = -0.21104594634643), .Names = c("id", "label", 
"link", "score")), e = 49.1279871269422), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.934821052832427, 
b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", 
    link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 9L, label = "Nebraska", link = "America/Scoresbysund", 
    score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", 
"label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", 
c = "P", d = list(structure(list(id = 8L, label = "Georgia", 
    link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 2L, label = "Washington", link = "America/Shiprock", 
    score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 6L, label = "North Dakota", link = "Universal", 
    score = 1.03168296038975), .Names = c("id", "label", 
"link", "score")), structure(list(id = 1L, label = "New Hampshire", 
    link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", 
"label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", 
    link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", 
"label", "link", "score"))), e = 132.1153538536), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = 0.202685974077313, b = "x", 
c = "O", d = structure(list(id = 3L, label = "Delaware", 
    link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", 
"label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.396243444741009, 
b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", 
    link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 9L, label = "Nebraska", link = "America/Ojinaga", 
    score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", 
"label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", 
"b", "c", "d", "e")))

借助JSON数据下载,我有一个列表列表.

I have a list of lists, by virtue of a JSON data download.

该列表包含176个元素,每个元素包含33个嵌套元素,其中一些也是不同长度的列表.

The list has 176 elements, each with 33 nested elements some of which are also lists of varying length.

我有兴趣分析包含在特定嵌套列表中的数据,该嵌套列表的长度为〜150,其中每个176个元素包含4个或5个元素-有些包含4个元素,有些包含5个元素.我正在尝试提取此感兴趣的嵌套列表,并将其转换为data.frame以便能够进行一些分析.

I am interested in analyzing the data contained in a particular nested list, which has a length of ~150 for each of the 176 which has either 4 or 5 elements -- some have 4 and some have 5. I am trying to extract this nested list of interest and convert it into a data.frame to be able to perform some analysis.

在上面的代表性示例数据中,我对l的5个元素中的每个元素的嵌套列表d感兴趣.因此,所需的data.frame看起来类似于:

In the representative sample data above, I am interested in the nested list d for each of the 5 elements of l. The desired data.frame would therefore look something like:

id           label            link       score  externalId
 5            Utah     Asia/Anadyr  -0.2110459          NA
 8  South Carolina  Pacific/Wallis   0.5265409   -6.743544
 .
 .

我一直在尝试使用purrr,它似乎在处理列表中的数据时具有合理且一致的流程,但是我遇到了无法完全理解原因的错误-很可能是我无法正确理解purrr或列表(可能两者)的命令/逻辑.这是我一直在尝试的代码,但会引发相关错误:

I've been attempting to use purrr which appears to have a sensible and consistent flow for processing data in lists, but I am running into errors that I can't fully understand the cause of -- could very well be that I don't properly understand the commands/logic of purrr or lists (likely both). This is the code I've been attempting but throws the associated error:

df <- map_df(l, "d", ~as.data.frame(.))
Error: incompatible sizes (5 != 4)

我认为这与每个组件的不同长度d有关,或者与包含的数据不同(有时是4个元素,有时是5个)有关,或者我在这里使用的功能未正确指定-实际上我我不太确定.

I believe this has to do with the differing lengths of d for each component, or perhaps the differing contained data (sometimes 4 elements sometimes 5) or perhaps the function I've used here is misspecified -- truthfully I'm not entirely sure.

我已经通过使用for循环解决了这个问题,我知道它效率低下,因此我在SO上提出了疑问.

I have worked around this by using a for loop, which I know is inefficient and hence my question here on SO.

这是我目前使用的for循环:

This is the for loop I currently employ:

df <- data.frame(id = integer(), label = character(), score = numeric(), externalId = numeric())
for(i in seq_along(l)){
    df_temp <- l[[i]][[4]] %>% map_df(~as.data.frame(.))
    df <- rbind(df, df_temp)
}

最好提供一些有关purrr的帮助-或者apply的某些版本,因为它仍然优于我的for循环-将不胜感激.另外,如果有上述资源,我想了解,而不仅仅是找到正确的代码.

Some assistance preferably with purrr - alternatively some version of apply as this is still superior to my for-loop - would be greatly appreciated. Also if there's a resource for the above I'd like to understand rather than just find the right code.

推荐答案

您可以分三步进行操作,首先拉出d,然后将行绑定到d的每个元素中,然后将所有内容绑定到一个单个对象.

You can do this in three steps, first pulling out d, then binding the rows within each element of d, and then binding everything into a single object.

我将 dplyr 中的bind_rows用于列表内行绑定. map_df执行最后的行绑定.

I use bind_rows from dplyr for the within-list row binding. map_df does the final row binding.

library(purrr)
library(dplyr)

l %>%
    map("d") %>%
    map_df(bind_rows)

这也等效:

map_df(l, ~bind_rows(.x[["d"]] ) )

结果如下:

# A tibble: 12 x 5
      id          label                 link      score externalId
   <int>          <chr>                <chr>      <dbl>      <dbl>
 1     5           Utah          Asia/Anadyr -0.2110459         NA
 2     8 South Carolina       Pacific/Wallis  0.5265409  -6.743544
 3     9       Nebraska America/Scoresbysund  0.2508955  16.425747
 4     8        Georgia         America/Nome  0.5264941   7.915836
 5     2     Washington     America/Shiprock -0.5551864  15.068666
 6     6   North Dakota            Universal  1.0316830         NA
 7     1  New Hampshire      America/Cordoba  1.2158206   9.727642
 8     1         Alaska        Asia/Istanbul -0.2318326         NA
 9     4   Pennsylvania Africa/Dar_es_Salaam  0.5902453         NA
10     3       Delaware       Asia/Samarkand  0.6955771  15.236482
11     4   North Dakota      America/Tortola  1.0306027  -7.216669
12     9       Nebraska      America/Ojinaga -1.1139800  -8.451451

这篇关于用`purrr`从列表列表中提取数据到它自己的`data.frame`中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆