何时使用 map() 函数以及何时使用 summarise_at()/mutate_at() [英] when to use map() function and when to use summarise_at()/mutate_at()

查看:49
本文介绍了何时使用 map() 函数以及何时使用 summarise_at()/mutate_at()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能就何时使用 map()(所有 map_..() 函数)以及何时使用 summarise_at()/ 提出建议mutate_at()?

Can anyone give a suggestion regarding when to use the map() (all map_..() functions) and when to use summarise_at()/mutate_at()?

例如如果我们对向量列做一些修改,那么我们不需要考虑 map() 吗?如果我们有一个 df/有一列有一个列表,那么我们需要使用 map()?

E.g. if we are doing some modification to the column of vectors then we do not need to think map() ? If we have a df / have a column has a list in it then we need to use map()?

map() 函数是否总是需要与 nest() 函数一起使用?任何人都可以推荐一些关于此的学习视频.以及如何将列表放入 df 并同时建模多个列表,然后将模型结果存储在另一列中?

Does map() function always need to be used with nest() function? Anyone could suggest some learning videos regarding this. And also how to put lists in df and modeling multiple lists at the same time then store the model results in another column ?

非常感谢!

推荐答案

{dplyr} 和 {purrr} 的最大区别在于 {dplyr} 设计为仅适用于 data.frames,而 {purrr} 设计为处理各种列表.Data.frames 是列表,您还可以使用 {purrr} 迭代 data.frame.

The biggest difference between {dplyr} and {purrr} is that {dplyr} is designed to work on data.frames only, and {purrr} is designed to work on every kind of lists. Data.frames being lists, you can also use {purrr} for iterating on a data.frame.

map_chr(iris, class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
   "numeric"    "numeric"    "numeric"    "numeric"     "factor" 

summarise_atmap_at 的行为并不完全相同:summarise_at 只返回您要查找的摘要,map_at 将所有 data.frame 作为列表返回,并在您要求的地方进行修改:

summarise_at and map_at do not exactly behave the same: summarise_at just return the summary you're looking for, map_at return all the data.frame as a list, with the modification done where you asked it :

> library(purrr)
> library(dplyr)
> small_iris <- sample_n(iris, 5)
> map_at(small_iris, c("Sepal.Length", "Sepal.Width"), mean)
$Sepal.Length
[1] 6.58

$Sepal.Width
[1] 3.2

$Petal.Length
[1] 6.7 1.3 5.7 4.3 4.7

$Petal.Width
[1] 2.0 0.4 2.1 1.3 1.5

$Species
[1] virginica  setosa     virginica  versicolor versicolor
Levels: setosa versicolor virginica

> summarise_at(small_iris, c("Sepal.Length", "Sepal.Width"), mean)
  Sepal.Length Sepal.Width
1         6.58         3.2

map_at 总是返回一个列表,mutate_at 总是一个 data.frame :

map_at always return a list, mutate_at always a data.frame :

> map_at(small_iris, c("Sepal.Length", "Sepal.Width"), ~ .x / 10)
$Sepal.Length
[1] 0.77 0.54 0.67 0.64 0.67

$Sepal.Width
[1] 0.28 0.39 0.33 0.29 0.31

$Petal.Length
[1] 6.7 1.3 5.7 4.3 4.7

$Petal.Width
[1] 2.0 0.4 2.1 1.3 1.5

$Species
[1] virginica  setosa     virginica  versicolor versicolor
Levels: setosa versicolor virginica

> mutate_at(small_iris, c("Sepal.Length", "Sepal.Width"), ~ .x / 10)
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1         0.77        0.28          6.7         2.0  virginica
2         0.54        0.39          1.3         0.4     setosa
3         0.67        0.33          5.7         2.1  virginica
4         0.64        0.29          4.3         1.3 versicolor
5         0.67        0.31          4.7         1.5 versicolor

所以总结一下您的第一个问题,如果您正在考虑对非嵌套 df 执行按列"操作并希望获得 data.frame 结果,您应该选择 {dplyr}.

So to sum up on your first question, if you are thinking about doing operation "column-wise" on a non-nested df and want to have a data.frame as a result, you should go for {dplyr}.

关于嵌套列,您必须结合来自 {tidyr}、mutate()group_by()nest()>map().您在这里所做的是创建一个较小版本的数据框,其中将包含一个列,该列是一个 data.frames 列表.然后,您将使用 map() 迭代这个新列中的元素.

Regarding nested column, you have to combine group_by(), nest() from {tidyr}, mutate() and map(). What you're doing here is creating a smaller version of your dataframe that will contain a column which is a list of data.frames. Then, you're going to use map() to iterate over the elements inside this new column.

这是我们心爱的鸢尾花的示例:

Here is an example with our beloved iris:

library(tidyr)

iris_n <- iris %>% 
  group_by(Species) %>% 
  nest()
iris_n
# A tibble: 3 x 2
  Species    data             
  <fct>      <list>           
1 setosa     <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica  <tibble [50 × 4]>

这里,新对象是一个 data.frame,列 data 是一个较小的 data.frames 列表,一个是 Species(我们在 group_by() 中指定的因子)代码>).然后,我们可以通过简单地执行以下操作来迭代此列:

Here, the new object is a data.frame with the colum data being a list of smaller data.frames, one by Species (the factor we specified in group_by()). Then, we can iterate on this column by simply doing :

map(iris_n$data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x))
[[1]]

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)

Coefficients:
(Intercept)  Sepal.Width  
     2.6390       0.6905  


[[2]]

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)

Coefficients:
(Intercept)  Sepal.Width  
     3.5397       0.8651  


[[3]]

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)

Coefficients:
(Intercept)  Sepal.Width  
     3.9068       0.9015  

但我们的想法是将所有内容都保存在 data.frame 中,因此我们可以使用 mutate 创建一个列来保留这个新的 lm 结果列表:

But the idea is to keep everything inside a data.frame, so we can use mutate to create a column that will keep this new list of lm results:

iris_n %>%
  mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)))
# A tibble: 3 x 3
  Species    data              lm      
  <fct>      <list>            <list>  
1 setosa     <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica  <tibble [50 × 4]> <S3: lm>

所以你可以运行几个 mutate() 来得到 r.squared 例如:

So you can run several mutate() to get the r.squared for e.g:

iris_n %>%
  mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)), 
         lm = map(lm, summary), 
         r_squared = map_dbl(lm, "r.squared")) 
# A tibble: 3 x 4
  Species    data              lm               r_squared
  <fct>      <list>            <list>               <dbl>
1 setosa     <tibble [50 × 4]> <S3: summary.lm>     0.551
2 versicolor <tibble [50 × 4]> <S3: summary.lm>     0.277
3 virginica  <tibble [50 × 4]> <S3: summary.lm>     0.209

但更有效的方法是使用 {purrr} 中的 compose() 来构建一个只执行一次的函数,而不是重复 mutate().

But a more efficient way is to use compose() from {purrr} to build a function that will do it once, instead of repeating the mutate().

get_rsquared <- compose(as_mapper("r.squared"), summary, lm)

iris_n %>%
  mutate(lm = map_dbl(data, ~ get_rsquared(Sepal.Length ~ Sepal.Width, data = .x)))
# A tibble: 3 x 3
  Species    data                 lm
  <fct>      <list>            <dbl>
1 setosa     <tibble [50 × 4]> 0.551
2 versicolor <tibble [50 × 4]> 0.277
3 virginica  <tibble [50 × 4]> 0.209

如果你知道你会一直使用 Sepal.Length ~ Sepal.Width,你甚至可以用 partial() 预填充 lm()代码>:

If you know you'll always be using Sepal.Length ~ Sepal.Width, you can even prefill lm() with partial():

pr_lm <- partial(lm, formula = Sepal.Length ~ Sepal.Width)
get_rsquared <- compose(as_mapper("r.squared"), summary, pr_lm)

iris_n %>%
  mutate(lm = map_dbl(data, get_rsquared))
# A tibble: 3 x 3
  Species    data                 lm
  <fct>      <list>            <dbl>
1 setosa     <tibble [50 × 4]> 0.551
2 versicolor <tibble [50 × 4]> 0.277
3 virginica  <tibble [50 × 4]> 0.209

关于资源,我在 {purrr} 上写了一系列博文,您可以查看:https://colinfay.me/tags/#purrr

Regarding the resources, I've written a series of blogpost on {purrr} you can check: https://colinfay.me/tags/#purrr

这篇关于何时使用 map() 函数以及何时使用 summarise_at()/mutate_at()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆