何时使用“ Do”在dplyr中起作用 [英] When to use "Do" function in dplyr

查看:59
本文介绍了何时使用“ Do”在dplyr中起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解到要在每个组中应用功能时使用 Do 函数。

I've learned that Do function is used when you want to apply a function to each group.

例如,如果我要从变量 Index 的 A, C和 I类别中拉出前2行,则可以使用以下语法。 / p>

for example, if I want to pull top 2 rows from "A", "C", and "I" categories of variable Index, following syntax can be used.

t <- mydata %>% filter(Index %in% c("A", "C", "I")) %>% group_by(Index) %>% do(head(.,2))

我知道按索引分组后, do 函数用于计算每个组的head(。,2)。

I understand that after grouping by index, do function is used to compute head(.,2) for each group.

但是,在某些情况下,根本没有使用 do 。例如,要计算按变量 Index 分组的变量 Y2014 的平均值,我认为应使用以下代码。 / p>

However, on some occasions, do is not used at all. For example, To compute mean of variable Y2014 grouped by variable Index, I thought that following code should be used.

t <- mydata %>% group_by(Index) %>% do(summarise(Mean_2014 = mean(Y2014)))

但是,以上语法返回错误

however, above syntax returns error

Error in mean(Y2014) : object 'Y2014' not found

但是如果我从语法中删除 do ,它将返回我真正想要的内容。

But if I remove do from the syntax, it returns what I exactly wanted.

t <- mydata %>% group_by(Index) %>% summarise(Mean_2014 = mean(Y2014))

我真的很困惑dplyr中 do 函数的用法。对我来说似乎不一致。什么时候应该使用和不使用 do 函数?为什么我应该在第一种情况下使用 do 而不在第二种情况下使用?

I'm really confused about usage of do function in dplyr. It seems inconsistent to me. When should I use and not use do function? Why should I use do in the first case and not in the second case?

推荐答案

问题下的注释讨论了很多情况下,您可以在dplyr或相关软件包中找到替代方法,从而避免使用 do ,问题中的示例为那种但是,要直接回答问题而不是通过替代方法:

The comments under the question discuss that in many cases you can find an alternative in dplyr or associated packages that avoid the use of do and the examples in the question are of that sort; however, to answer the question directly rather than via alternatives:

在数据帧的上下文中,使用 do 与不使用 do 之间的主要区别是:

Within the context of data frames, the key differences between using do and not using do are:


  1. 不自动插入点 do 中的代码不会将点自动插入第一个参数。例如,不是一个问题中的 do(summarise(Mean_2014 = mean(Y2014)))代码,而是必须编写 do(summarise( 。,Mean_2014 = mean(Y2014)),因为该点不会自动插入。这是由于 do %>%而不是总结。尽管理解这一点很重要,所以如果目的只是为了避免将点自动插入到第一个参数中,则在需要时我们可以插入点,我们也可以使用括号括起来,以达到以下效果: %%>% {myfun(arg1,arg2)} 也将自动插入点作为 myfun 调用的第一个参数。

  1. No automatic insertion of dot The code within the do will not have dot automatically inserted into the first argument. For example, instead of the do(summarise(Mean_2014 = mean(Y2014))) code in the question one would have to write do(summarise(., Mean_2014 = mean(Y2014))) with a dot since the dot is not automatically inserted. This is a consequence of do being the right hand side function of %>% rather than summarize. Although this is important to understand so that we insert dot when needed if the purpose were simply to avoid automatic insertion of dot into the first argument we could alternately use brace brackets to get that effect: whatever %>% { myfun(arg1, arg2) } would also not automatically insert dot as the first argument of the myfun call.

尊重group_by 只有专门为尊重 group_by 而编写的函数才能这样做。这里有两个问题。 (1)仅为尊重 group_by 而专门编写的功能将为每个组运行一次​​。 变异汇总 do 是运行函数的示例每个小组一次(也有其他小组)。 (2)即使该功能为每个组运行一次​​,也存在如何处理点的问题。我们关注两种情况(不是完整的列表):(i)如果不使用 do ,那么如果在表达式中对参数的函数调用中使用点,它将请参考整个输入,而忽略 group_by 。大概这是magrittr的点替换规则的结果,并且它对 group_by 一无所知。另一方面(ii)在 do 中的点始终表示当前组的行。例如,比较这两者的输出,请注意,在第一种情况下,点引用3行,其中使用 do ,而在第二种情况下,所有6行均不引用。尽管事实上总结尊重 group_by ,因为它每个组运行一次​​。

respecting group_by Only functions specifically written to respect group_by will do so. There are two issues here. (1) Only functions specifically written to respect group_by will be run once for each group. mutate, summarize and do are examples of functions that run once per group (there are others too). (2) Even if the function is run once for each group there is the question of how dot is handled. We focus on two cases (not a complete list): (i) if do is not used then if dot is used within a function call within an expression to an argument it will refer to the entire input ignoring group_by. Presumably this is a consequence of magrittr's dot substitution rules and it not knowing anything about group_by. On the other hand (ii) within do dot always refers to the rows of the current group. For example, compare the output of these two and note that dot refers to 3 rows in the first case where do is used and all 6 rows in the second where it is not. This is despite the fact that summarize respects group_by in that it runs once per group.

BOD$g <- c(1, 1, 1, 2, 2, 2)
BOD %>% group_by(g) %>% do(summarize(., nr = nrow(.)))
## # A tibble: 2 x 2
## # Groups: g [2]
##       g    nr
##   <dbl> <int>
## 1  1.00     3
## 2  2.00     3

BOD %>% group_by(g) %>% summarize(nr = nrow(.))
## # A tibble: 2 x 2
##       g    nr
##   <dbl> <int>
## 1  1.00     6
## 2  2.00     6


有关更多信息,请参见?do

See ?do for more information.

现在我们遍历问题中的代码。由于从未在问题中定义 mydata ,因此我们使用下面的第一行代码对其进行定义,以方便具体的示例。

Now we go through the code in the question. As mydata was never defined in the question we use the first line of code below to define it to facilitate concrete examples.

mydata <- data.frame(Index = rep(c("A", "C", "I"), each = 3), Y2014 = 1)

mydata %>% 
       filter(Index %in% c("A", "C", "I")) %>% 
       group_by(Index) %>% 
       do(head(., 2))

## # A tibble: 6 x 2
## # Groups: Index [3]
##   Index  Y2014
##   <fctr> <dbl>
## 1 A       1.00
## 2 A       1.00
## 3 C       1.00
## 4 C       1.00
## 5 I       1.00
## 6 I       1.00

上面的代码为3组中的每组产生2行,给出6行。如果我们省略了 do ,那么它将忽略 group_by 并仅产生两行,其中点被视为输入,而不是一次只输入每个组。 (在这种特殊情况下,dplyr提供了自己的 head 替代方案,避免了这些问题,但为了说明一般性观点,我们坚持问题代码。)

The code above produces 2 rows for each of the 3 groups giving 6 rows. Had we omitted do then it would disregard group_by and produce only two rows with dot being regarded as the entire 9 rows of input, not just each group at a time. (In this particular case dplyr provides its own alternative to head that avoids these problems but for sake of illustrating the general point we stick to the code in the question.)

问题中的以下代码会产生错误,因为在 do 内未完成点插入,因此应该是第一个摘要参数(即数据框输入)缺失:

The following code from the question generates an error because dot insertion is not done within do and so what ought to be the first argument of summarize, i.e. the data frame input, is missing:

mydata %>% 
       group_by(Index) %>% 
       do(summarise(Mean_2014 = mean(Y2014)))
## Error in mean(Y2014) : object 'Y2014' not found

如果我们删除上面代码中的 do ,如问题,那么就可以了,因为执行了点插入。或者,如果我们添加点 do(summarise(。,Mean_2014 = mean(Y2014))),尽管 do 在这种情况下确实显得多余,因为总结已经尊重了 group_by ,因此无需将其包装在<$中c $ c> do 。

If we remove the do in the above code, as in the last line of code in the question, then it works since the dot insertion is performed. Alternately if we add the dot do(summarise(., Mean_2014 = mean(Y2014))) it would also work although do really seems superfluous in this case as summarize already respects group_by so there is no need to wrap it in do.

mydata %>% 
       group_by(Index) %>% 
       summarise(Mean_2014 = mean(Y2014))

## # A tibble: 3 x 2
##   Index  Mean_2014
##   <fctr>     <dbl>
## 1 A           1.00
## 2 C           1.00
## 3 I           1.00

这篇关于何时使用“ Do”在dplyr中起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆