如何引用嵌套数据框中的列(然后使用purrr :: map) [英] How to reference a column in a nested dataframe (then use purrr::map)

查看:98
本文介绍了如何引用嵌套数据框中的列(然后使用purrr :: map)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于引用嵌套数据框中的数据列,我有一个非常简单的问题。

I have a very simple question about referencing data columns within a nested dataframe.

对于可重现的示例,我将嵌套 mtcars 通过变量 am 的两个值:

For a reproducible example, I'll nest mtcars by the two values of variable am:

library(tidyverse)
mtcars_nested <- mtcars %>% 
  group_by(am) %>% 
  nest()
mtcars_nested

给出的数据如下。

#> # A tibble: 2 x 2
#> # Groups:   am [2]
#>      am data              
#>   <dbl> <list>            
#> 1     1 <tibble [13 × 10]>
#> 2     0 <tibble [19 × 10]>

如果我现在想使用 purrr :: map 取每个水平 am

If I now wanted to use purrr::map to take the mean of mpg for each level of am

我想知道为什么这行不通:

I wonder why this doesn't work:


take_mean_mpg <- function(df){
  mean(df[["data"]]$mpg)
}

map(mtcars_nested, take_mean_mpg)



Error in df[["data"]] : subscript out of bounds

或更简单的问题是:嵌套后 mpg 列应如何正确引用。我知道这不起作用:

Or maybe a simpler question is: How should I properly reference the mpg column, once it's nested. I know that this doesn't work:

mtcars_nested[["data"]]$mpg


推荐答案

数据帧(和tbls)是列列表,而不是行列表,所以当您将整个tbl mtcars_nest 传递给 map(),它遍历列而不是行。您可以在函数中使用 mutate map_dbl ,以便新列不是列表列。

dataframes (and tbls) are lists of columns, not lists of rows, so when you pass the whole tbl mtcars_nest to map() it is iterating over the columns not over the rows. You can use mutate with your function, and map_dbl so that your new columns is not a list column.

library(tidyverse)
mtcars_nested <- mtcars %>% 
  group_by(am) %>% 
  nest()
mtcars_nested

take_mean_mpg <- function(df){
  mean(df$mpg)
}

mtcars_nested %>%
  mutate(mean_mpg = map_dbl(.data[["data"]], take_mean_mpg))

.data [[ data]] 参数 map_dbl() 会从您的数据框中为其提供 data 列表列,而不是整个数据框。参数的 .data 部分与名为 data的列无关,它是 rlang代词.data 来引用您的整个数据框。 [[ data]] 然后从数据框中检索名为 data的列。之所以使用mutate,是因为您试图(我认为可能是错误的)将一列带有平均值的列添加到嵌套数据框中。 mutate()用于添加列,因此您添加的列等于 map()的输出(或 map_dbl())与函数一起使用,它将返回平均值列表(或向量)。

The .data[["data"]] argument to map_dbl() gives it the data list column from you dataframe to iterate over, rather than the entire dataframe. The .data part of the argument has no relation to your column named "data", it is the rlang pronoun .data to reference your whole dataframe. [["data"]] then retrieves the column named "data" from your dataframe. You use mutate because you are trying (I assumed, perhaps incorrectly) to add a column with the averages to the nested dataframe. mutate() is used to add columns, so you add a column equal to the output of map() (or map_dbl()) with your function, which will return the list (or vector) of averages.

我一个令人困惑的概念。尽管 map()通常用于遍历数据框的行,但从技术上讲,它遍历列表(请参见文档,其中在参数下显示:

This can me a confusing concept. Although map() is often used to iterate over the rows of a dataframe, it technically iterates over a list (see the documentation, where under the arguments it says:


.x 列表或原子向量。

它也返回列表或向量。新闻是列只是值的列表,因此您将要迭代的列表(列)传递给它,然后将其分配给要存储它的列表(列)(此赋值发生在 mutate())。

It also returns a list or a vector. The good news is that columns are just lists of values, so you pass it the list (column) you want it to iterate over and assign it to the list (column) where you want it stored (this assignment happens with mutate()).

这篇关于如何引用嵌套数据框中的列(然后使用purrr :: map)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆