data.table分别对数字和文本变量分组 [英] data.table grouping separately on numeric and text variables

查看:122
本文介绍了data.table分别对数字和文本变量分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图简化这个 data.table 两个阶段的过程,对数字和字符变量都起作用。例如。 - 接受 textvar sum 的每个数字变量的第一个元素。考虑这个小范例:

I'm trying to simplify this data.table two-stage process which acts on both numeric and character variables. E.g. - take the first element of textvar and sum each of the numeric variables. Consider this small example:

library(data.table)
dt <- data.table(grpvar=letters[c(1,1,2)], textvar=c("one","two","one"),
                 numvar=1:3, othernum=2:4)
dt
#   grpvar textvar numvar othernum
#1:      a     one      1        2
#2:      a     two      2        3
#3:      b     one      3        4

现在我的第一个想法是嵌套 .SD lapply 调用,但我认为这有点复杂:

Now my first thought was to nest .SD to drop the one variable out of the lapply call, but I thought that was a bit complicated:

dt[, c(textvar=textvar[1], .SD[, lapply(.SD, sum), .SDcols=-c("textvar")]), by=grpvar]
#   grpvar textvar numvar othernum
#1:      a     one      3        5
#2:      b     one      3        4

然后我想可能我可以单独做每个分组,并加入它们,但是似乎更糟:

Then I thought maybe I could do each grouping separately and join them, but that seems even worse:

dt[, .(textvar=textvar[1]), by=grpvar][ 
  dt[, lapply(.SD, sum), by=grpvar, .SDcols=-c("textvar")], on="grpvar" 
]
#   grpvar textvar numvar othernum
#1:      a     one      3        5
#2:      b     one      3        4

有一个更简单的结构,可以绕过 .SD 的嵌套或加入?

Is there a simpler construction that would get around the nesting of .SD or the joining? I feel like I'm overlooking something elementary.

推荐答案

j - data.table 中的提示(有意)非常灵活。我们需要记住的是:

The j-argument in data.table is (deliberately) quite flexible. All we need to remember is that:


只要 j

使用事实,列表中的每个元素都将成为数据表中的一个列。 c(list,list)是一个列表,我们可以构造表达式如下:

Using the fact that c(list, list) is a list, we can construct the expression as follows:

dt[, c(textvar = textvar[1L], lapply(.SD, sum)), # select/compute all cols necessary
      .SDcols = numvar:othernum,                 # provide .SD's columns 
      by = grpvar]                               # group by 'grpvar'
#    grpvar textvar numvar othernum
# 1:      a     one      3        5
# 2:      b     one      3        4

这里,我没有包装第一个表达式 list()因为 textvar [1L] 返回长度= 1向量.. ie, c(1,list(2,3)),c(list(1),list(2,3))) TRUE

Here, I've not wrapped the first expression with list() since textvar[1L] returns a length=1 vector.. i.e., identical(c(1, list(2, 3)), c(list(1), list(2,3))) is TRUE.

请注意,这只能从 v1.9.7 。该bug最近刚刚在当前开发版本中修复。

Note that this is only possible from v1.9.7. The bug was just recently fixed in the current development version.

这篇关于data.table分别对数字和文本变量分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆