快速融化的数据表操作 [英] Fast melted data.table operations

查看:17
本文介绍了快速融化的数据表操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找用于操作 data.table 对象的模式,这些对象的结构类似于使用 reshape2 包中的 melt 创建的数据帧.我正在处理具有数百万行的数据表.性能至关重要.

I am looking for patterns for manipulating data.table objects whose structure resembles that of dataframes created with melt from the reshape2 package. I am dealing with data tables with millions of rows. Performance is critical.

问题的一般形式是是否有一种方法可以根据列中的值子集执行分组,并让分组操作的结果创建一个或多个新列.

The generalized form of the question is whether there is a way to perform grouping based on a subset of values in a column and have the result of the grouping operation create one or more new columns.

问题的一种具体形式可能是如何使用 data.table 来完成与 dcast 在以下方面的等效操作:

A specific form of the question could be how to use data.table to accomplish the equivalent of what dcast does in the following:

input <- data.table(
  id=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3), 
  variable=c('x', 'y', 'y', 'x', 'y', 'y', 'x', 'x', 'y', 'other'),
  value=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
dcast(input, 
  id ~ variable, sum, 
  subset=.(variable %in% c('x', 'y')))

输出是

  id  x  y
1  1  1  5
2  2  4 11
3  3 15  9

推荐答案

未经测试的快速答案:似乎您正在寻找 by-without-by,也就是 grouping-by-i:

Quick untested answer: seems like you're looking for by-without-by, a.k.a. grouping-by-i :

setkey(input,variable)
input[c("x","y"),sum(value)]

这就像 SQL 中的快速 HAVING.j 会针对 i 的每一行进行评估.换句话说,上面是相同的结果,但比:

This is like a fast HAVING in SQL. j gets evaluated for each row of i. In other words, the above is the same result but much faster than :

input[,sum(value),keyby=variable][c("x","y")]

在仅选择感兴趣的组之前,对所有组(浪费地)进行后一个子集和评估.前者(by-without-by)只直接进入组的子集.

The latter subsets and evals for all the groups (wastefully) before selecting only the groups of interest. The former (by-without-by) goes straight to the subset of groups only.

组结果将一如既往地以长格式返回.但是之后在(相对较小的)聚合数据上重新调整到广泛的范围应该是相对即时的.反正就是这么想的.

The group results will be returned in long format, as always. But reshaping to wide afterwards on the (relatively small) aggregated data should be relatively instant. That's the thinking anyway.

如果 input 有很多不感兴趣的列,则第一个 setkey(input,variable) 可能会咬人.如果是这样,可能值得对所需的列进行子集:

The first setkey(input,variable) might bite if input has a lot of columns not of interest. If so, it might be worth subsetting the columns needed :

DT = setkey(input[ , c("variable","value")], variable)
DT[c("x","y"),sum(value)]

将来实现辅助键时会更容易:

In future when secondary keys are implemented that would be easier :

set2key(input,variable)              # add a secondary key 
input[c("x","y"),sum(value),key=2]   # syntax speculative

也可以按 id 分组:

setkey(input,variable)
input[c("x","y"),sum(value),by='variable,id']

并在密钥中包含 id 可能值得 setkey 的成本,具体取决于您的数据:

and including id in the key might be worth setkey's cost depending on your data :

setkey(input,variable,id)
input[c("x","y"),sum(value),by='variable,id']

如果您将 by-without-by 与 by 结合起来,如上所述,那么 by-without-by 就像子集一样运行;即,j 仅在缺少 by 时为 i 的每一行运行(因此名称为 by-without-by).因此,您需要再次在 by 中包含 variable,如上所示.

If you combine a by-without-by with by, as above, then the by-without-by then operates just like a subset; i.e., j is only run for each row of i when by is missing (hence the name by-without-by). So you need to include variable, again, in the by as shown above.

或者,以下内容应按 id 而非x"和y"的联合进行分组(但以上是您在问题中所要求的,iiuc):

Alternatively, the following should group by id over the union of "x" and "y" instead (but the above is what you asked for in the question, iiuc) :

input[c("x","y"),sum(value),by=id]

这篇关于快速融化的数据表操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆