折叠每个列上具有不同操作混合的data.table [英] Collapse a data.table with a mix of different operations on each column

查看:158
本文介绍了折叠每个列上具有不同操作混合的data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下功能的 data.table

I have a data.table with the following features:


  • code> bycols 将数据分成组

  • nonvaryingcols (这样,从每个组内携带第一个项目并携带该项目就足够了)

  • datacols (例如在群组内汇总它们)

  • bycols that divide the data into groups
  • nonvaryingcols that are constant within each group (so that taking the first item from within each group and carrying that through would be sufficient)
  • datacols that need some summary operation performed on them (e.g. sum them within group)

我很好奇什么是最有效的方法来做你可能称为混合崩溃,将所有三个上述输入作为字符向量。

I'm curious what the most efficient way to do what you might call a mixed collapse, taking all three of the above inputs as character vectors. It doesn't have to be the absolute fastest, but fast enough with reasonable syntax would be ideal.

范例资料:

require(data.table)
set.seed(1)
bycols <- c("g1","g2")
datacols <- c("dat1","dat2")
nonvaryingcols <- c("nv1","nv2")
test <- data.table(
  g1 = rep( letters, 10 ),
  g2 = rep( c(LETTERS,LETTERS), each = 5 ),
  dat1 = runif( 260 ),
  dat2 = runif( 260 ),
  nv1 = rep( seq(130), 2),
  nv2 = rep( seq(130), 2) 
)

最终数据应如下所示:

   g1 g2      dat1      dat2 nv1 nv2
1:  a  A 0.8403809 0.6713090   1   1
2:  b  A 0.4491883 0.4607716   2   2
3:  c  A 0.6083939 1.2031960   3   3
4:  d  A 1.5510033 1.2945761   4   4
5:  e  A 1.1302971 0.8573135   5   5
6:  f  B 1.4964821 0.5133297   6   6



<我已经制定了两种不同的方式做,但一个是可怕的僵硬和笨重,一个是可怕的缓慢。

I have worked out two different ways of doing it, but one is horridly inflexible and unwieldy, and one is horridly slow. Will post tomorrow if no one has come up with something better by then.

推荐答案

一直以来,这种程序化的使用 [。data.table ,一般的策略是构造一个表达式 e ,可以在 j 参数。一旦你明白(我相信你这样做),它只是成为一个游戏使用语言计算,得到一个 j -slot表达式,

As always with this sort of programmatic use of [.data.table, the general strategy is to construct an expression e that that can be evaluated in the j argument. Once you understand that (as I'm sure you do), it just becomes a game of computing on the language to get a j-slot expression that looks like what you'd write at the command line.

在这里,例如,给定示例中的特定值,您需要一个看起来像这样的调用:

Here, for instance, and given the particular values in your example, you'd like a call that looks like:

test[, list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1]),
       by=c("g1", "g2")]

您要在 j -slot中评估的表达式为

so the expression you'd like evaluated in the j-slot is

list(dat1=sum(dat1), dat2=sum(dat2), nv1=nv1[1], nv2=nv2[1])

下面的大部分函数被构造为只是那个表达式:

Most of the following function is taken up with constructing just that expression:

f <- function(dt, bycols, datacols, nvcols) {
    e <- c(sapply(datacols, function(x) call("sum", as.symbol(x))),
           sapply(nvcols, function(x) call("[", as.symbol(x), 1)))
    e<- as.call(c(as.symbol("list"), e))
    dt[,eval(e), by=bycols]
}

f(test, bycols=bycols, datacols=datacols, nvcols=nonvaryingcols)
##      g1 g2      dat1      dat2 nv1 nv2
##   1:  a  A 0.8403809 0.6713090   1   1
##   2:  b  A 0.4491883 0.4607716   2   2
##   3:  c  A 0.6083939 1.2031960   3   3
##   4:  d  A 1.5510033 1.2945761   4   4
##   5:  e  A 1.1302971 0.8573135   5   5
##  ---                                  
## 126:  v  Z 0.5627018 0.4282380 126 126
## 127:  w  Z 0.7588966 1.4429034 127 127
## 128:  x  Z 0.7060596 1.3736510 128 128
## 129:  y  Z 0.6015249 0.4488285 129 129
## 130:  z  Z 1.5304034 1.6012207 130 130

这篇关于折叠每个列上具有不同操作混合的data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆