dplyr：分组（group_by）数据帧上的colSums：优美 [英] dplyr: colSums on sub-grouped (group_by) data frames: elegantly

查看：160 发布时间：2020/10/26 3:59:08 r dplyr

本文介绍了dplyr：分组（group_by）数据帧上的colSums：优美的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个非常大的数据框（265,874 x 30），分为三个明智的组：年龄类别（1-6），日期（5479这样）和地理位置（总共4）。每条记录均包含上述27种计数变量。我想按每个分组变量分组，然后对所得的27个子分组进行colSums运算。我一直尝试使用 dplyr （v0.2）进行操作，因为手动完成操作会导致设置大量多余的内容（或诉诸循环来遍历分组选项，缺乏优雅的解决方案）。

I have a very large dataframe (265,874 x 30), with three sensible groups: an age category (1-6), dates (5479 such) and geographic locality (4 total). Each record consists of a choice from each of these, plus 27 count variables. I want to group by each of the grouping variables, then take a colSums on the resulting sub-grouped 27 variables. I've been trying to use dplyr (v0.2) to do it, because doing it manually ends up setting up a lot of redundant things (or resorting to a loop for iterating across the grouping options, for lack of an elegant solution).

示例代码：

countData <- sample(0:10, 2000, replace = TRUE)
dates <- sample(seq(as.Date("2010/1/1"), as.Date("2010/01/30"), "days"), 200, replace = TRUE)
locality <- sample(1:2, 2000, replace = TRUE)
ageCat <- sample(1:2, 2000, replace = TRUE)
sampleDF <- data.frame(dates, locality, ageCat, matrix(countData, nrow = 200, ncol = 10))

然后我想做的就是...

then what I'd like to do is ...

library("dplyr")
sampleDF %.% group_by(locality, ageCat, dates) %.% do(colSums(.[, -(1:3)]))

但这不是很有效，因为colSums（）的结果不是数据帧。如果我将其投放，则它可以正常工作：

but this doesn't quite work, as the results from colSums() aren't data frames. If I cast it, it works:

sampleDF %.% group_by(locality, ageCat, dates) %.% do(data.frame(matrix(colSums(.[, -(1:3)]), nrow = 1, ncol = 10)))

，但最后的do（...）位似乎很笨重。

but the final do(...) bit seems very clunky.

是否有任何关于如何更优雅或更有效地执行此操作的想法？我想问题归结为：如何最好地使用do（）函数和。

Any thoughts on how to do this more elegantly or effectively? I guess the question comes down to: how best to use the do() function and the . operator to summarize a data frame via colSums.

注意：do（。）运算符仅适用于 dplyr 0.2，因此您需要获取它来自GitHub（链接），而不是来自CRAN。

Note: the do(.) operator only applies to dplyr 0.2, so you need to grab it from GitHub (link), not from CRAN.

编辑：建议的结果

三种解决方案：

我的建议是：146.765秒。

My suggestion in post: elapsed, 146.765 seconds.

@joran的建议是：6.902秒

@joran's suggestion below: 6.902 seconds

@eddi在评论中的建议，使用data.table：6.715秒。

@eddi's suggestion in the comments, using data.table: 6.715 seconds.

我不必费心去复制，只是使用system.time（）进行了粗略的量度。从它的外观来看， dplyr 和 data.table 在我的数据集上的表现大致相同，并且在正确使用时，两者的速度都比我想出的黑客解决方案快得多

I didn't bother to replicate, just used system.time() to get a rough gauge. From the looks of it, dplyr and data.table perform approximately the same on my data set, and both are significantly faster when used properly than the hack solution I came up with yesterday.

推荐答案

除非我丢失了某些内容，否则这似乎是 summarise_each （一种来自 plyr 的 colwise 类似物）：


Unless I'm missing something, this seems like a job for summarise_each (a sort of colwise analogue from plyr):
sampleDF %.% group_by(locality, ageCat, dates) %.% summarise_each(funs(sum))

默认情况下，汇总功能中不包含分组列，并且您只能选择一部分列来应用与使用 select相同的技术来应用功能。
The grouping column are not included in the summarizing function by default, and you can select only a subset of columns to apply the functions to using the same technique as when using select.
（ summarise_each 是 dplyr 的0.2版>但据我所知，不是0.1.3。）
(summarise_each is in version 0.2 of dplyr but not in 0.1.3, as far as I know.)

                        这篇关于dplyr：分组（group_by）数据帧上的colSums：优美的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

dplyr：分组（group_by）数据帧上的colSums：优美 [英] dplyr: colSums on sub-grouped (group_by) data frames: elegantly

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

dplyr：分组（group_by）数据帧上的colSums：优美 [英] dplyr: colSums on sub-grouped (group_by) data frames: elegantly

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭