用tapply按组求和多列 [英] sum multiple columns by group with tapply

查看:120
本文介绍了用tapply按组求和多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想按组对各个列进行求和,而我的第一个想法是使用tapply. 但是,我无法使tapply正常工作.可以使用tapply求和多列吗? 如果没有,为什么不呢?

I wanted to sum individual columns by group and my first thought was to use tapply. However, I cannot get tapply to work. Can tapply be used to sum multiple columns? If not, why not?

我在互联网上进行了广泛搜索,发现张贴了许多类似的问题 最早可以追溯到2008年.但是,这些问题都没有直接得到回答. 相反,响应总是建议使用其他函数.

I have searched the internet extensively and found numerous similar questions posted as far back as 2008. However, none of those questions have been answered directly. Instead, the responses invariably suggest using a different function.

以下是我希望按州对苹果,按州对樱桃求和的示例数据集 和李子按州.在此之下,我为tapply编写了许多替代方案, 工作.

Below is an example data set for which I wish to sum apples by state, cherries by state and plums by state. Below that I have compiled numerous alternatives to tapply that do work.

在底部,我显示了对tapply源代码的简单修改,该修改允许 tapply执行所需的操作.

At the bottom I show a simple modification to the tapply source code that allows tapply to perform the desired operation.

尽管如此,也许我忽略了一种执行所需操作的简单方法 用tapply.我没有在寻找替代功能,尽管欢迎其他替代功能.

Nevertheless, perhaps I am overlooking a simple way to perform the desired operation with tapply. I am not looking for alternative functions, although additional alternatives are welcome.

鉴于对tapply源代码的修改很简单,所以我想知道为什么,或者 类似的东西尚未实施.

Given the simplicity of my modification to the tapply source code I wonder why it, or something similar, has not already been implemented.

谢谢您的任何建议.如果我的问题是重复的,我将很乐意张贴我的 问题作为对其他问题的答案.

Thank you for any advice. If my question is a duplicate I will be happy to post my question as an answer to that other question.

这是示例数据集:

df.1 <- read.table(text = '

    state   county   apples   cherries   plums
       AA        1        1          2       3
       AA        2       10         20      30
       AA        3      100        200     300
       BB        7       -1         -2      -3
       BB        8      -10        -20     -30
       BB        9     -100       -200    -300

', header = TRUE, stringsAsFactors = FALSE)

这不起作用:

tapply(df.1, df.1$state, function(x) {colSums(x[,3:5])})

帮助页面上显示:

tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)

X       an atomic object, typically a vector.

我对短语typically a vector感到困惑,这让我怀疑 可以使用一个数据帧.我还不清楚atomic object是什么意思.

I was confused by the phrase typically a vector which made me wonder whether a data frame could be used. I have never been clear on what atomic object means.

这是tapply的几种可行的替代方法.第一种选择是将tapplyapply组合在一起的解决方法.

Here are several alternatives to tapply that do work. The first alternative is a work-around that combines tapply with apply.

apply(df.1[,c(3:5)], 2, function(x) tapply(x, df.1$state, sum))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

with(df.1, aggregate(df.1[,3:5], data.frame(state), sum))

#   state apples cherries plums
# 1    AA    111      222   333
# 2    BB   -111     -222  -333

t(sapply(split(df.1[,3:5], df.1$state), colSums))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

t(sapply(split(df.1[,3:5], df.1$state), function(x) apply(x, 2, sum)))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

aggregate(df.1[,3:5], by=list(df.1$state), sum)

#   Group.1 apples cherries plums
# 1      AA    111      222   333
# 2      BB   -111     -222  -333

by(df.1[,3:5], df.1$state, colSums)

# df.1$state: AA
#   apples cherries    plums 
#      111      222      333 
# ------------------------------------------------------------ 
# df.1$state: BB
#   apples cherries    plums 
#     -111     -222     -333

with(df.1, 
     aggregate(x = list(apples   = apples, 
                        cherries = cherries,
                        plums    = plums), 
               by = list(state   = state), 
               FUN = function(x) sum(x)))

#   state apples cherries plums
# 1    AA    111      222   333
# 2    BB   -111     -222  -333

lapply(split(df.1, df.1$state), function(x) {colSums(x[,3:5])} )

# $AA
#   apples cherries    plums 
#      111      222      333 
#
# $BB
#   apples cherries    plums 
#     -111     -222     -333

这是tapply的源代码,除了我更改了该行:

Here is the source code for tapply except that I changed the line:

nx <- length(X)

收件人:

nx <- ifelse(is.vector(X), length(X), dim(X)[1])

此修改后的tapply版本执行所需的操作:

This modified version of tapply performs the desired operation:

my.tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
    FUN <- if (!is.null(FUN)) match.fun(FUN)
    if (!is.list(INDEX)) INDEX <- list(INDEX)
    nI <- length(INDEX)
    if (!nI) stop("'INDEX' is of length zero")
    namelist <- vector("list", nI)
    names(namelist) <- names(INDEX)
    extent <- integer(nI)
    nx     <- ifelse(is.vector(X), length(X), dim(X)[1])  # replaces nx <- length(X)
    one <- 1L
    group <- rep.int(one, nx) #- to contain the splitting vector
    ngroup <- one
    for (i in seq_along(INDEX)) {
    index <- as.factor(INDEX[[i]])
    if (length(index) != nx)
        stop("arguments must have same length")
    namelist[[i]] <- levels(index)#- all of them, yes !
    extent[i] <- nlevels(index)
    group <- group + ngroup * (as.integer(index) - one)
    ngroup <- ngroup * nlevels(index)
    }
    if (is.null(FUN)) return(group)
    ans <- lapply(X = split(X, group), FUN = FUN, ...)
    index <- as.integer(names(ans))
    if (simplify && all(unlist(lapply(ans, length)) == 1L)) {
    ansmat <- array(dim = extent, dimnames = namelist)
    ans <- unlist(ans, recursive = FALSE)
    } else {
    ansmat <- array(vector("list", prod(extent)),
            dim = extent, dimnames = namelist)
    }
    if(length(index)) {
        names(ans) <- NULL
        ansmat[index] <- ans
    }
    ansmat
}

my.tapply(df.1$apples, df.1$state, function(x) {sum(x)})

#  AA   BB 
# 111 -111

my.tapply(df.1[,3:4] , df.1$state, function(x) {colSums(x)})

# $AA
#   apples cherries 
#      111      222 
#
# $BB
#   apples cherries 
#     -111     -222

推荐答案

tapply适用于矢量,对于data.frame,您可以使用by(它是tapply的包装,请看一下代码):

tapply works on a vector, for a data.frame you can use by (which is a wrapper for tapply, take a look at the code):

> by(df.1[,c(3:5)], df.1$state, FUN=colSums)
df.1$state: AA
  apples cherries    plums 
     111      222      333 
------------------------------------------------------------------------------------- 
df.1$state: BB
  apples cherries    plums 
    -111     -222     -333 

这篇关于用tapply按组求和多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆