如何通过多个单独的组将R用于相同的字段聚合 [英] How to use R for same field aggregation by multiple separate group
问题描述
我正在尝试分别对几个(实际上是数百个)组进行指标计数(不对所有组的所有组合进行计数)。我将通过一个简化示例进行演示:
I'm trying to perform count of an indicator on several (actually hundreds) groups separately (NOT on all combinations of all groups). I'll demonstrate it by simplified example:
假设我拥有该数据集
data<-cbind(c(1,1,1,2,2,2)
,c(1,1,2,2,2,3)
,c(3,2,1,2,2,3))
> data
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 1 2
[3,] 1 2 1
[4,] 2 2 2
[5,] 2 2 2
[6,] 2 3 3
和一个指标
some_indicator<-c(1,0,0,1,0,1)
然后我要无循环运行(例如通过列)
then I want to run without loops (like apply by column) something like,
aggregate(some_indicator,list(data[,1]),sum)
aggregate(some_indicator,list(data[,2]),sum)
aggregate(some_indicator,list(data[,3]),sum)
会产生以下结果:
[,1] [,2] [,3]
[1,] 1 1 0
[2,] 2 1 1
[3,] 0 1 2
即对于每列(值子集在列之间变化不大),按值计算指标并将其合并。
i.e. for each column (values subset do not change much between columns), count the indicator by value and merge it.
当前,我在列上循环编写了代码,但我需要更有效的方法,因为有很多列,而且要花一个多小时。
Currently I wrote it with a loop over columns, but I need much more efficient way, since there are lot of columns and It takes over an hour.
在此先感谢Michael,
。
Thanks in advance, Michael.
推荐答案
1)轻触 tapply
的第一个参数是 data
,每列替换为 some_indicator
。第二个参数表示我们希望按数据中的组和列号进行分组。
1) tapply The first argument of tapply
is data
with each column replaced by some_indicator
. The second argument indicates that we wish to group by the groups in data and by the column number.
result <- tapply(replace(data, TRUE, some_indicator), list(data, col(data)), sum)
replace(unname(result), is.na(result), 0)
对于问题中显示的输入,最后一行给出:
For the input shown in the question, the last line gives:
[,1] [,2] [,3]
[1,] 1 1 0
[2,] 2 1 1
[3,] 0 1 2
1a)轻触更长的时间 tapply
解决方案如下。 fun
将一列作为其参数,并使用 tapply
来汇总 some_indicator中的组
使用该列作为组;但是,不同的列中可以有不同的组集合,因此要确保它们都具有相同的组集合(以供以后对齐),我们实际上按 factor(x,levs)$ c $进行分组c>。
apply
将 fun
应用于 data
的每一列。需要 as.data.frame
是因为 data
是一个矩阵,所以 sapply
将应用到每个元素而不是每个列。
1a) tapply A somewhat longer tapply
solution would be the following. fun
takes a column as its argument and uses tapply
to sum the groups in some_indicator
using that column as the group; however, different columns could have different sets of groups in them so to ensure that they all have the same set of groups (for later alignment) we actually groups by factor(x, levs)
. The sapply
applies fun
to each column of data
. The as.data.frame
is needed since data
is a matrix so sapply
would apply across each element rather than each column if we were to apply it to that.
levs <- levels(factor(data))
fun <- function(x) tapply(some_indicator, factor(x, levs), sum)
result <- sapply(as.data.frame(data), fun)
replace(unname(result), is.na(result), 0)
2)xtabs 这与 tapply
解决方案非常相似。它确实具有以下优点:(1) xtabs
暗示 sum
,因此无需指定,并且( 2)未填充的单元格用0而不是NA填充,从而消除了用0代替NA的额外步骤。另一方面,我们必须使用 c $ c $将公式的每个分量分解为向量c>因为不同于
tapply
, xtabs
公式将不接受矩阵:
2) xtabs This is quite similar to the tapply
solution. It does have the advantages that: (1) sum
is implied by xtabs
and so need not be specified and also (2) unfilled cells are filled with 0 rather than NA eliminating the extra step of replacing of NAs with 0. On the other hand we must unravel each component of the formula into a vector using c
since unlike tapply
the xtabs
formula will not accept matrices:
result <- xtabs(c(replace(data, TRUE, some_indicator)) ~ c(data) + c(col(data)))
dimnames(result) <- NULL
对于问题中的数据,得出:
For the data in the question this gives:
> result
[,1] [,2] [,3]
[1,] 1 1 0
[2,] 2 1 1
[3,] 0 1 2
已修订修改为 tapply
解决方案,并添加了 xtabs
解决方案。
REVISED Revised tapply
solution and added xtabs
solution.
这篇关于如何通过多个单独的组将R用于相同的字段聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!