按组将不同的功能应用于不同的列集 [英] Apply different functions to different sets of columns by group

查看:81
本文介绍了按组将不同的功能应用于不同的列集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下功能的 data.table

I have a data.table with the following features:


  • bycols :将数据分为组的列

  • nonvaryingcols :每组中恒定的列(这样一来,从每个组中取出第一项并进行下去就足够了)

  • datacols :要汇总的列/总结(例如,将它们汇总到组中)

  • bycols: columns that divide the data into groups
  • nonvaryingcols: columns that are constant within each group (so that taking the first item from within each group and carrying that through would be sufficient)
  • datacols: columns to be aggregated / summarized (e.g. sum them within group)

我很好奇哪种最有效的方式可以做混合崩溃,可以将这三个因素都考虑在内以上输入作为字符向量。它不一定是绝对最快的,但是足够快且具有合理的语法将是理想的。

I'm curious what the most efficient way to do what you might call a mixed collapse, taking all three of the above inputs as character vectors. It doesn't have to be the absolute fastest, but fast enough with reasonable syntax would be ideal.

示例数据,其中不同的列集存储在字符向量中。

Example data, where the different sets of columns are stored in character vectors.

require(data.table)
set.seed(1)
bycols <- c("g1","g2")
datacols <- c("dat1","dat2")
nonvaryingcols <- c("nv1","nv2")
test <- data.table(
  g1 = rep( letters, 10 ),
  g2 = rep( c(LETTERS,LETTERS), each = 5 ),
  dat1 = runif( 260 ),
  dat2 = runif( 260 ),
  nv1 = rep( seq(130), 2),
  nv2 = rep( seq(130), 2) 
)

最终数据应如下所示:

   g1 g2      dat1      dat2 nv1 nv2
1:  a  A 0.8403809 0.6713090   1   1
2:  b  A 0.4491883 0.4607716   2   2
3:  c  A 0.6083939 1.2031960   3   3
4:  d  A 1.5510033 1.2945761   4   4
5:  e  A 1.1302971 0.8573135   5   5
6:  f  B 1.4964821 0.5133297   6   6

我已经设计出两种不同的处理方式,但一种方法极其呆板且笨拙,一种非常慢。

I have worked out two different ways of doing it, but one is horridly inflexible and unwieldy, and one is horridly slow. Will post tomorrow if no one has come up with something better by then.

推荐答案

像往常一样,这种程序化使用 [。data.table ,通常的策略是构造一个表达式 e ,该表达式可以在 j 参数。一旦您理解了这一点(就像我确定的那样),它就变成了根据语言进行计算,以获取一个看起来像您想要的 j 槽表达式

As always with this sort of programmatic use of [.data.table, the general strategy is to construct an expression e that that can be evaluated in the j argument. Once you understand that (as I'm sure you do), it just becomes a game of computing on the language to get a j-slot expression that looks like what you'd write at the command line.

例如,在此示例中,鉴于给定的特定值,您希望进行如下调用:

Here, for instance, and given the particular values in your example, you'd like a call that looks like:

test[, list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1]),
       by=c("g1", "g2")]

so您要在 j 槽中计算的表达式是

so the expression you'd like evaluated in the j-slot is

list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1])

以下大部分功能仅用于构建该表达式:

Most of the following function is taken up with constructing just that expression:

f <- function(dt, bycols, datacols, nvcols) {
    e <- c(sapply(datacols, function(x) call("sum", as.symbol(x))),
           sapply(nvcols, function(x) call("[", as.symbol(x), 1)))
    e<- as.call(c(as.symbol("list"), e))
    dt[,eval(e), by=bycols]
}

f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
##      g1 g2      dat1      dat2 nv1 nv2
##   1:  a  A 0.8403809 0.6713090   1   1
##   2:  b  A 0.4491883 0.4607716   2   2
##   3:  c  A 0.6083939 1.2031960   3   3
##   4:  d  A 1.5510033 1.2945761   4   4
##   5:  e  A 1.1302971 0.8573135   5   5
##  ---                                  
## 126:  v  Z 0.5627018 0.4282380 126 126
## 127:  w  Z 0.7588966 1.4429034 127 127
## 128:  x  Z 0.7060596 1.3736510 128 128
## 129:  y  Z 0.6015249 0.4488285 129 129
## 130:  z  Z 1.5304034 1.6012207 130 130

这篇关于按组将不同的功能应用于不同的列集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆