使用dplyr汇总并保留相同的变量名 [英] Using dplyr to summarize and keep the same variable name

查看:71
本文介绍了使用dplyr汇总并保留相同的变量名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现在尝试做相同的事情时,data.table和dplyr具有不同的结果。我想使用dplyr语法,但是以data.table的方式进行计算。用例是我想将小计添加到表中。为此,我需要对每个变量进行一些汇总,但是要保留相同的变量名(在转换后的版本中)。 Data.table允许我对变量执行一些聚合并保持相同的名称。然后使用相同的变量进行另一个聚合。它将继续使用未转换的版本。 Dplyr将使用转换后的版本。

I have found that data.table and dplyr have differing results when trying to do the same thing. I would like to use dplyr syntax, but have it compute in the way that data.table does. The use case is that I want to add subtotals to a table. To do that, I need to do some aggregation to each variable, but then keep the same variable names (in the transformed version). Data.table allows me to perform some aggregation on a variable and keep the same name. Then do another aggregation with that same variable. It will continue to use the untransformed version. Dplyr, however, will use the transformed version.

summaryize 文档中,其内容为:

# Note that with data frames, newly created summaries immediately
# overwrite existing variables
mtcars %>%
  group_by(cyl) %>%
  summarise(disp = mean(disp), sd = sd(disp))

这基本上是我遇到的问题,但是我想知道是否有一个不错的解决方法。我发现的一件事是只是将转换后的变量命名为其他名称,然后最后重命名,但这对我来说并不好。如果有个很好的小计方法,那也很高兴知道。我环顾了这个站点,却没有看到讨论的确切情况。任何帮助将不胜感激!

This is basically the issue I am running into, but I'm wondering if there is a nice workaround. One thing I found was to just name the transformed variable something else then rename it at the end, but that does not look very nice to me. If there is a nice way to do subtotals, that'd be good to know as well. I looked around this site and did not see this exact situation discussed. Any help would be greatly appreciated!

在这里,我举了一个简单的例子,一次使用data.table的结果,一次使用dplyr的结果。我想使用这个简单的表并附加一个小计行,该行是感兴趣的列(总计)的加权平均值。

Here I have made a simple example, once with data.table's results, and once with dplyr's. I want to take this simple table and append a subtotal row that is the weighted average of the column of interest (Total).

library(data.table)
library(dplyr)

dt <- data.table(Group = LETTERS[1:5],
                 Count = c(1000, 1500, 1200, 2000, 5000),
                 Total = c(50, 300, 600, 400, 1000))
dt[, Count_Dist := Count/sum(Count)]
dt[, .(Count_Dist = sum(Count_Dist), Weighted_Total = sum(Count_Dist*Total))]

dt <- rbind(dt[, .(Group, Count_Dist, Total)],
      dt[, .(Group = "All", Count_Dist = sum(Count_Dist), Total = sum(Count_Dist*Total))])
setnames(dt, "Total", "Weighted_Avg_Total")

dt

df <- data.frame(Group = LETTERS[1:5],
                 Count = c(1000, 1500, 1200, 2000, 5000),
                 Total = c(50, 300, 600, 400, 1000))

df %>%
  mutate(Count_Dist = Count/sum(Count)) %>%
  summarize(Count_Dist = sum(Count_Dist),
            Weighted_Total = sum(Count_Dist*Total))

df %>% 
  mutate(Count_Dist = Count/sum(Count)) %>%
  select(Group, Count_Dist, Total) %>% 
  rbind(df %>%
          mutate(Count_Dist = Count/sum(Count)) %>%
          summarize(Group = "All",
                    Count_Dist = sum(Count_Dist),
                    Total = sum(Count_Dist*Total))) %>% 
  rename(Weighted_Avg_Total = Total)

再次感谢您的帮助!

推荐答案

一种可行的解决方案是跳过 mutate 步骤并使用 transmute 进行第一个突变 / select 步骤,并直接从原始变量计算所需变量,而无需创建中间变量第二个 mutate 步骤的变量:

A possible solution is to skip the mutate steps and use transmute for the first mutate/select-step and directly calculate the desired variables from the original variables without creating an intermediate variable for the second mutate-step:

df %>% 
  transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>% 
  bind_rows(df %>%
              summarize(Group = "All",
                        Count_Dist = sum(Count/sum(Count)),
                        Weighted_Avg_Total = sum((Count/sum(Count))*Total)))

可以得到:


  Group Count_Dist Weighted_Avg_Total
1     A 0.09345794            50.0000
2     B 0.14018692           300.0000
3     C 0.11214953           600.0000
4     D 0.18691589           400.0000
5     E 0.46728972          1000.0000
6   All 1.00000000           656.0748







另一种可能的解决方案是更改在 dplyr 中计算新变量的顺序,然后使用 select 将列顺序恢复为您所需要的最初想要的:


Another possible solution is to alter the order in which the new variables are calculated in dplyr and then use select to get the column-order back into what you originally wanted:

df %>% 
  mutate(Count_Dist = Count/sum(Count)) %>%
  select(Group, Count_Dist, Weighted_Avg_Total = Total) %>% 
  bind_rows(df %>%
              mutate(Count_Dist = Count/sum(Count)) %>%
              summarize(Group = "All",
                        Weighted_Avg_Total = sum(Count_Dist*Total),
                        Count_Dist = sum(Count_Dist)) %>% 
              select(Group, Count_Dist, Weighted_Avg_Total))






如果您还希望包含 Count 列,则可以这样做(根据我在下面的评论):


If you want to include the Count-column as well, you could do (based on my comment from below):

df %>% 
  transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>% 
  bind_rows(df %>%
              summarize(Group = "All",
                        Count_Dist = sum(Count/sum(Count)),
                        Weighted_Avg_Total = sum((Count/sum(Count))*Total),
                        Count = sum(Count)))

这篇关于使用dplyr汇总并保留相同的变量名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆