使用dplyr汇总并保留相同的变量名 [英] Using dplyr to summarize and keep the same variable name
问题描述
我发现在尝试做相同的事情时,data.table和dplyr具有不同的结果。我想使用dplyr语法,但是以data.table的方式进行计算。用例是我想将小计添加到表中。为此,我需要对每个变量进行一些汇总,但是要保留相同的变量名(在转换后的版本中)。 Data.table允许我对变量执行一些聚合并保持相同的名称。然后使用相同的变量进行另一个聚合。它将继续使用未转换的版本。 Dplyr将使用转换后的版本。
I have found that data.table and dplyr have differing results when trying to do the same thing. I would like to use dplyr syntax, but have it compute in the way that data.table does. The use case is that I want to add subtotals to a table. To do that, I need to do some aggregation to each variable, but then keep the same variable names (in the transformed version). Data.table allows me to perform some aggregation on a variable and keep the same name. Then do another aggregation with that same variable. It will continue to use the untransformed version. Dplyr, however, will use the transformed version.
在 summaryize 文档中,其内容为:
# Note that with data frames, newly created summaries immediately
# overwrite existing variables
mtcars %>%
group_by(cyl) %>%
summarise(disp = mean(disp), sd = sd(disp))
这基本上是我遇到的问题,但是我想知道是否有一个不错的解决方法。我发现的一件事是只是将转换后的变量命名为其他名称,然后最后重命名,但这对我来说并不好。如果有个很好的小计方法,那也很高兴知道。我环顾了这个站点,却没有看到讨论的确切情况。任何帮助将不胜感激!
This is basically the issue I am running into, but I'm wondering if there is a nice workaround. One thing I found was to just name the transformed variable something else then rename it at the end, but that does not look very nice to me. If there is a nice way to do subtotals, that'd be good to know as well. I looked around this site and did not see this exact situation discussed. Any help would be greatly appreciated!
在这里,我举了一个简单的例子,一次使用data.table的结果,一次使用dplyr的结果。我想使用这个简单的表并附加一个小计行,该行是感兴趣的列(总计)的加权平均值。
Here I have made a simple example, once with data.table's results, and once with dplyr's. I want to take this simple table and append a subtotal row that is the weighted average of the column of interest (Total).
library(data.table)
library(dplyr)
dt <- data.table(Group = LETTERS[1:5],
Count = c(1000, 1500, 1200, 2000, 5000),
Total = c(50, 300, 600, 400, 1000))
dt[, Count_Dist := Count/sum(Count)]
dt[, .(Count_Dist = sum(Count_Dist), Weighted_Total = sum(Count_Dist*Total))]
dt <- rbind(dt[, .(Group, Count_Dist, Total)],
dt[, .(Group = "All", Count_Dist = sum(Count_Dist), Total = sum(Count_Dist*Total))])
setnames(dt, "Total", "Weighted_Avg_Total")
dt
df <- data.frame(Group = LETTERS[1:5],
Count = c(1000, 1500, 1200, 2000, 5000),
Total = c(50, 300, 600, 400, 1000))
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Count_Dist = sum(Count_Dist),
Weighted_Total = sum(Count_Dist*Total))
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Total) %>%
rbind(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Group = "All",
Count_Dist = sum(Count_Dist),
Total = sum(Count_Dist*Total))) %>%
rename(Weighted_Avg_Total = Total)
再次感谢您的帮助!
推荐答案
一种可行的解决方案是跳过 mutate
步骤并使用 transmute
进行第一个突变
/ select
步骤,并直接从原始变量计算所需变量,而无需创建中间变量第二个 mutate
步骤的变量:
A possible solution is to skip the mutate
steps and use transmute
for the first mutate
/select
-step and directly calculate the desired variables from the original variables without creating an intermediate variable for the second mutate
-step:
df %>%
transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total)))
可以得到:
Group Count_Dist Weighted_Avg_Total
1 A 0.09345794 50.0000
2 B 0.14018692 300.0000
3 C 0.11214953 600.0000
4 D 0.18691589 400.0000
5 E 0.46728972 1000.0000
6 All 1.00000000 656.0748
另一种可能的解决方案是更改在 dplyr
中计算新变量的顺序,然后使用 select
将列顺序恢复为您所需要的最初想要的:
Another possible solution is to alter the order in which the new variables are calculated in dplyr
and then use select
to get the column-order back into what you originally wanted:
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Group = "All",
Weighted_Avg_Total = sum(Count_Dist*Total),
Count_Dist = sum(Count_Dist)) %>%
select(Group, Count_Dist, Weighted_Avg_Total))
如果您还希望包含 Count
列,则可以这样做(根据我在下面的评论):
If you want to include the Count
-column as well, you could do (based on my comment from below):
df %>%
transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total),
Count = sum(Count)))
这篇关于使用dplyr汇总并保留相同的变量名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!