用tapply按组求和多列 [英] sum multiple columns by group with tapply
问题描述
我想按组对各个列进行求和,而我的第一个想法是使用tapply
.
但是,我无法使tapply
正常工作.可以使用tapply
求和多列吗?
如果没有,为什么不呢?
I wanted to sum individual columns by group and my first thought was to use tapply
.
However, I cannot get tapply
to work. Can tapply
be used to sum multiple columns?
If not, why not?
我在互联网上进行了广泛搜索,发现张贴了许多类似的问题 最早可以追溯到2008年.但是,这些问题都没有直接得到回答. 相反,响应总是建议使用其他函数.
I have searched the internet extensively and found numerous similar questions posted as far back as 2008. However, none of those questions have been answered directly. Instead, the responses invariably suggest using a different function.
以下是我希望按州对苹果,按州对樱桃求和的示例数据集
和李子按州.在此之下,我为tapply
编写了许多替代方案,
工作.
Below is an example data set for which I wish to sum apples by state, cherries by state
and plums by state. Below that I have compiled numerous alternatives to tapply
that
do work.
在底部,我显示了对tapply
源代码的简单修改,该修改允许
tapply
执行所需的操作.
At the bottom I show a simple modification to the tapply
source code that allows
tapply
to perform the desired operation.
尽管如此,也许我忽略了一种执行所需操作的简单方法
用tapply
.我没有在寻找替代功能,尽管欢迎其他替代功能.
Nevertheless, perhaps I am overlooking a simple way to perform the desired operation
with tapply
. I am not looking for alternative functions, although additional alternatives are welcome.
鉴于对tapply
源代码的修改很简单,所以我想知道为什么,或者
类似的东西尚未实施.
Given the simplicity of my modification to the tapply
source code I wonder why it, or
something similar, has not already been implemented.
谢谢您的任何建议.如果我的问题是重复的,我将很乐意张贴我的 问题作为对其他问题的答案.
Thank you for any advice. If my question is a duplicate I will be happy to post my question as an answer to that other question.
这是示例数据集:
df.1 <- read.table(text = '
state county apples cherries plums
AA 1 1 2 3
AA 2 10 20 30
AA 3 100 200 300
BB 7 -1 -2 -3
BB 8 -10 -20 -30
BB 9 -100 -200 -300
', header = TRUE, stringsAsFactors = FALSE)
这不起作用:
tapply(df.1, df.1$state, function(x) {colSums(x[,3:5])})
帮助页面上显示:
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
X an atomic object, typically a vector.
我对短语typically a vector
感到困惑,这让我怀疑
可以使用一个数据帧.我还不清楚atomic object
是什么意思.
I was confused by the phrase typically a vector
which made me wonder whether
a data frame could be used. I have never been clear on what atomic object
means.
这是tapply
的几种可行的替代方法.第一种选择是将tapply
与apply
组合在一起的解决方法.
Here are several alternatives to tapply
that do work. The first alternative is a work-around that combines tapply
with apply
.
apply(df.1[,c(3:5)], 2, function(x) tapply(x, df.1$state, sum))
# apples cherries plums
# AA 111 222 333
# BB -111 -222 -333
with(df.1, aggregate(df.1[,3:5], data.frame(state), sum))
# state apples cherries plums
# 1 AA 111 222 333
# 2 BB -111 -222 -333
t(sapply(split(df.1[,3:5], df.1$state), colSums))
# apples cherries plums
# AA 111 222 333
# BB -111 -222 -333
t(sapply(split(df.1[,3:5], df.1$state), function(x) apply(x, 2, sum)))
# apples cherries plums
# AA 111 222 333
# BB -111 -222 -333
aggregate(df.1[,3:5], by=list(df.1$state), sum)
# Group.1 apples cherries plums
# 1 AA 111 222 333
# 2 BB -111 -222 -333
by(df.1[,3:5], df.1$state, colSums)
# df.1$state: AA
# apples cherries plums
# 111 222 333
# ------------------------------------------------------------
# df.1$state: BB
# apples cherries plums
# -111 -222 -333
with(df.1,
aggregate(x = list(apples = apples,
cherries = cherries,
plums = plums),
by = list(state = state),
FUN = function(x) sum(x)))
# state apples cherries plums
# 1 AA 111 222 333
# 2 BB -111 -222 -333
lapply(split(df.1, df.1$state), function(x) {colSums(x[,3:5])} )
# $AA
# apples cherries plums
# 111 222 333
#
# $BB
# apples cherries plums
# -111 -222 -333
这是tapply
的源代码,除了我更改了该行:
Here is the source code for tapply
except that I changed the line:
nx <- length(X)
收件人:
nx <- ifelse(is.vector(X), length(X), dim(X)[1])
此修改后的tapply
版本执行所需的操作:
This modified version of tapply
performs the desired operation:
my.tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
FUN <- if (!is.null(FUN)) match.fun(FUN)
if (!is.list(INDEX)) INDEX <- list(INDEX)
nI <- length(INDEX)
if (!nI) stop("'INDEX' is of length zero")
namelist <- vector("list", nI)
names(namelist) <- names(INDEX)
extent <- integer(nI)
nx <- ifelse(is.vector(X), length(X), dim(X)[1]) # replaces nx <- length(X)
one <- 1L
group <- rep.int(one, nx) #- to contain the splitting vector
ngroup <- one
for (i in seq_along(INDEX)) {
index <- as.factor(INDEX[[i]])
if (length(index) != nx)
stop("arguments must have same length")
namelist[[i]] <- levels(index)#- all of them, yes !
extent[i] <- nlevels(index)
group <- group + ngroup * (as.integer(index) - one)
ngroup <- ngroup * nlevels(index)
}
if (is.null(FUN)) return(group)
ans <- lapply(X = split(X, group), FUN = FUN, ...)
index <- as.integer(names(ans))
if (simplify && all(unlist(lapply(ans, length)) == 1L)) {
ansmat <- array(dim = extent, dimnames = namelist)
ans <- unlist(ans, recursive = FALSE)
} else {
ansmat <- array(vector("list", prod(extent)),
dim = extent, dimnames = namelist)
}
if(length(index)) {
names(ans) <- NULL
ansmat[index] <- ans
}
ansmat
}
my.tapply(df.1$apples, df.1$state, function(x) {sum(x)})
# AA BB
# 111 -111
my.tapply(df.1[,3:4] , df.1$state, function(x) {colSums(x)})
# $AA
# apples cherries
# 111 222
#
# $BB
# apples cherries
# -111 -222
推荐答案
tapply
适用于矢量,对于data.frame,您可以使用by
(它是tapply
的包装,请看一下代码):
tapply
works on a vector, for a data.frame you can use by
(which is a wrapper for tapply
, take a look at the code):
> by(df.1[,c(3:5)], df.1$state, FUN=colSums)
df.1$state: AA
apples cherries plums
111 222 333
-------------------------------------------------------------------------------------
df.1$state: BB
apples cherries plums
-111 -222 -333
这篇关于用tapply按组求和多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!