如何通过多个单独的组使用 R 进行相同的字段聚合 [英] How to use R for same field aggregation by multiple separate group

查看:17
本文介绍了如何通过多个单独的组使用 R 进行相同的字段聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试分别对几个(实际上是数百个)组(而不是所有组的所有组合)执行指标计数.我将通过简化示例来演示:

I'm trying to perform count of an indicator on several (actually hundreds) groups separately (NOT on all combinations of all groups). I'll demonstrate it by simplified example:

假设我有那个数据集

data<-cbind(c(1,1,1,2,2,2)
,c(1,1,2,2,2,3)
,c(3,2,1,2,2,3))
> data

      [,1] [,2] [,3]
[1,]    1    1    3
[2,]    1    1    2
[3,]    1    2    1
[4,]    2    2    2
[5,]    2    2    2
[6,]    2    3    3

和一个指示器

some_indicator<-c(1,0,0,1,0,1)

然后我想在没有循环的情况下运行(例如按列应用),例如

then I want to run without loops (like apply by column) something like,

aggregate(some_indicator,list(data[,1]),sum)
aggregate(some_indicator,list(data[,2]),sum)
aggregate(some_indicator,list(data[,3]),sum)

这将产生以下结果:

     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    1
[3,]    0    1    2

即对于每一列(值子集在列之间变化不大),按值计算指标并合并.

i.e. for each column (values subset do not change much between columns), count the indicator by value and merge it.

目前我是用列循环写的,但我需要更有效的方法,因为有很多列并且需要一个多小时.

Currently I wrote it with a loop over columns, but I need much more efficient way, since there are lot of columns and It takes over an hour.

提前致谢,迈克尔.

推荐答案

1) tapply tapply 的第一个参数是 data 与每个列替换为 some_indicator.第二个参数表示我们希望按数据中的组和列号进行分组.

1) tapply The first argument of tapply is data with each column replaced by some_indicator. The second argument indicates that we wish to group by the groups in data and by the column number.

result <- tapply(replace(data, TRUE, some_indicator), list(data, col(data)), sum)
replace(unname(result), is.na(result), 0)

对于问题中显示的输入,最后一行给出:

For the input shown in the question, the last line gives:

     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    1
[3,]    0    1    2

1a) tapply 稍长的 tapply 解决方案如下.fun 将一列作为其参数,并使用 tapply 将该列作为组来对 some_indicator 中的组求和;但是,不同的列中可能有不同的组集,以确保它们都具有相同的组集(用于以后对齐),我们实际上按 factor(x, levs) 进行分组.sapplyfun 应用到 data 的每一列.as.data.frame 是必需的,因为 data 是一个矩阵,所以如果我们要应用 sapply 将应用于每个元素而不是每列到此为止.

1a) tapply A somewhat longer tapply solution would be the following. fun takes a column as its argument and uses tapply to sum the groups in some_indicator using that column as the group; however, different columns could have different sets of groups in them so to ensure that they all have the same set of groups (for later alignment) we actually groups by factor(x, levs). The sapply applies fun to each column of data. The as.data.frame is needed since data is a matrix so sapply would apply across each element rather than each column if we were to apply it to that.

 levs <- levels(factor(data))
 fun <- function(x) tapply(some_indicator, factor(x, levs), sum)
 result <- sapply(as.data.frame(data), fun)
 replace(unname(result), is.na(result), 0)

2) xtabs 这与 tapply 解决方案非常相似.它确实具有以下优点:(1) sumxtabs 隐含,因此无需指定,并且 (2) 未填充的单元格填充为 0 而不是 NA 消除将 NA 替换为 0 的额外步骤.另一方面,我们必须使用 c 将公式的每个组件分解为一个向量,因为与 tapply 不同的是 xtabs 公式不接受矩阵:

2) xtabs This is quite similar to the tapply solution. It does have the advantages that: (1) sum is implied by xtabs and so need not be specified and also (2) unfilled cells are filled with 0 rather than NA eliminating the extra step of replacing of NAs with 0. On the other hand we must unravel each component of the formula into a vector using c since unlike tapply the xtabs formula will not accept matrices:

result <- xtabs(c(replace(data, TRUE, some_indicator)) ~ c(data) + c(col(data)))
dimnames(result) <- NULL

对于问题中的数据,这给出了:

For the data in the question this gives:

> result
     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    1
[3,]    0    1    2

REVISED 修改了 tapply 解决方案并添加了 xtabs 解决方案.

REVISED Revised tapply solution and added xtabs solution.

这篇关于如何通过多个单独的组使用 R 进行相同的字段聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆