如何通过多个单独的组将R用于相同的字段聚合 [英] How to use R for same field aggregation by multiple separate group

查看:105
本文介绍了如何通过多个单独的组将R用于相同的字段聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试分别对几个(实际上是数百个)组进行指标计数(不对所有组的所有组合进行计数)。我将通过一个简化示例进行演示:

I'm trying to perform count of an indicator on several (actually hundreds) groups separately (NOT on all combinations of all groups). I'll demonstrate it by simplified example:

假设我拥有该数据集

data<-cbind(c(1,1,1,2,2,2)
,c(1,1,2,2,2,3)
,c(3,2,1,2,2,3))
> data

      [,1] [,2] [,3]
[1,]    1    1    3
[2,]    1    1    2
[3,]    1    2    1
[4,]    2    2    2
[5,]    2    2    2
[6,]    2    3    3

和一个指标

some_indicator<-c(1,0,0,1,0,1)

然后我要无循环运行(例如通过列)

then I want to run without loops (like apply by column) something like,

aggregate(some_indicator,list(data[,1]),sum)
aggregate(some_indicator,list(data[,2]),sum)
aggregate(some_indicator,list(data[,3]),sum)

会产生以下结果:

     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    1
[3,]    0    1    2

即对于每列(值子集在列之间变化不大),按值计算指标并将其合并。

i.e. for each column (values subset do not change much between columns), count the indicator by value and merge it.

当前,我在列上循环编写了代码,但我需要更有效的方法,因为有很多列,而且要花一个多小时。

Currently I wrote it with a loop over columns, but I need much more efficient way, since there are lot of columns and It takes over an hour.

在此先感谢Michael,

Thanks in advance, Michael.

推荐答案

1)轻触 tapply 的第一个参数是 data ,每列替换为 some_indicator 。第二个参数表示我们希望按数据中的组和列号进行分组。

1) tapply The first argument of tapply is data with each column replaced by some_indicator. The second argument indicates that we wish to group by the groups in data and by the column number.

result <- tapply(replace(data, TRUE, some_indicator), list(data, col(data)), sum)
replace(unname(result), is.na(result), 0)

对于问题中显示的输入,最后一行给出:

For the input shown in the question, the last line gives:

     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    1
[3,]    0    1    2

1a)轻触更长的时间 tapply 解决方案如下。 fun 将一列作为其参数,并使用 tapply 来汇总 some_indicator中的组使用该列作为组;但是,不同的列中可以有不同的组集合,因此要确保它们都具有相同的组集合(以供以后对齐),我们实际上按 factor(x,levs) apply fun 应用于 data 的每一列。需要 as.data.frame 是因为 data 是一个矩阵,所以 sapply 将应用到每个元素而不是每个列。

1a) tapply A somewhat longer tapply solution would be the following. fun takes a column as its argument and uses tapply to sum the groups in some_indicator using that column as the group; however, different columns could have different sets of groups in them so to ensure that they all have the same set of groups (for later alignment) we actually groups by factor(x, levs). The sapply applies fun to each column of data. The as.data.frame is needed since data is a matrix so sapply would apply across each element rather than each column if we were to apply it to that.

 levs <- levels(factor(data))
 fun <- function(x) tapply(some_indicator, factor(x, levs), sum)
 result <- sapply(as.data.frame(data), fun)
 replace(unname(result), is.na(result), 0)

2)xtabs 这与 tapply 解决方案非常相似。它确实具有以下优点:(1) xtabs 暗示 sum ,因此无需指定,并且( 2)未填充的单元格用0而不是NA填充,从而消除了用0代替NA的额外步骤。另一方面,我们必须使用 c 因为不同于 tapply xtabs 公式将不接受矩阵:

2) xtabs This is quite similar to the tapply solution. It does have the advantages that: (1) sum is implied by xtabs and so need not be specified and also (2) unfilled cells are filled with 0 rather than NA eliminating the extra step of replacing of NAs with 0. On the other hand we must unravel each component of the formula into a vector using c since unlike tapply the xtabs formula will not accept matrices:

result <- xtabs(c(replace(data, TRUE, some_indicator)) ~ c(data) + c(col(data)))
dimnames(result) <- NULL

对于问题中的数据,得出:

For the data in the question this gives:

> result
     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    1
[3,]    0    1    2

已修订修改为 tapply 解决方案,并添加了 xtabs 解决方案。

REVISED Revised tapply solution and added xtabs solution.

这篇关于如何通过多个单独的组将R用于相同的字段聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆