在`data.table`中使用动态列名 [英] Using dynamic column names in `data.table`

查看:141
本文介绍了在`data.table`中使用动态列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算data.table中的几列中的每一列的平均值,按另一列分组。我的问题类似于SO的其他两个问题( one < a>和两个),但我couldn不适用我的问题。

I want to calculate mean of each of several columns in a data.table, grouped by another column. My question is similar to two other questions on SO (one and two) but I couldn't apply those on my problem.

这里是一个例子:

library(data.table)
dtb <- fread(input = "condition,var1,var2,var3
      one,100,1000,10000
      one,101,1001,10001
      one,102,1002,10002
      two,103,1003,10003
      two,104,1004,10004
      two,105,1005,10005
      three,106,1006,10006
      three,107,1007,10007
      three,108,1008,10008
      four,109,1009,10009
      four,110,1010,10010")

dtb
#    condition var1 var2  var3
# 1:       one  100 1000 10000
# 2:       one  101 1001 10001
# 3:       one  102 1002 10002
# 4:       two  103 1003 10003
# 5:       two  104 1004 10004
# 6:       two  105 1005 10005
# 7:     three  106 1006 10006
# 8:     three  107 1007 10007
# 9:     three  108 1008 10008
# 10:     four  109 1009 10009
# 11:     four  110 1010 10010

每个单个平均值的计算很容易;例如对于var1: dtb [,mean(var1),by = condition] 。但是,如果有很多变量,我需要写所有的变量,这很快变得麻烦。因此, dtb [,list(mean(var1),mean(var2),mean(var3))by = condition] 我需要的列名称是动态的,我想结束了像这样:

The calculation of each single mean is easy; e.g. for "var1": dtb[ , mean(var1), by = condition]. But I this quickly becomes cumbersome if there are many variables and you need to write all of them. Thus, dtb[, list(mean(var1), mean(var2), mean(var3)), by = condition] is undesirable. I need the column names to be dynamic and I wish to end up with something like this:

   condition  var1   var2    var3
1:       one 101.0 1001.0 10001.0
2:       two 104.0 1004.0 10004.0
3:     three 107.0 1007.0 10007.0
4:      four 109.5 1009.5 10009.5


推荐答案

您应该使用 .SDcols 列数过多,并且只需要对列的子集(除了分组变量列)执行特定操作。

you should use .SDcols (especially if you've too many columns and you require a particular operation to be performed only on a subset of the columns (apart from the grouping variable columns).

dtb[, lapply(.SD, mean), by=condition, .SDcols=2:4]

#    condition  var1   var2    var3
# 1:       one 101.0 1001.0 10001.0
# 2:       two 104.0 1004.0 10004.0
# 3:     three 107.0 1007.0 10007.0
# 4:      four 109.5 1009.5 10009.5

您还可以在变量中获取所有想要使用的列名,然后将其传递给 .SDcols 像这样:

You could also get all the column names you'd want to take mean of first in a variable and then pass it to .SDcols like this:

keys <- setdiff(names(dtb), "condition")
# keys = var1, var2, var3
dtb[, lapply(.SD, mean), by=condition, .SDcols=keys]

Edit:正如Matthew Dowle正确地指出的,因为你需要在条件,你可以这样做:

As Matthew Dowle rightly pointed out, since you require mean to be computed on every other column after grouping by condition, you could just do:

dtb[, lapply(.SD, mean), by=condition]

David的编辑(被拒绝):了解更多 .SD 来自 此信息 。我觉得这是相关的。感谢@David。

David's edit: (which got rejected): Read more about .SD from this post. I find this is relevant here. Thanks @David.

编辑2:假设您有一个 data.table 1000行和301列(一个分组列和300个数字列):

Edit 2: Suppose you have a data.table with 1000 rows and 301 columns (one column for grouping and 300 numeric columns):

require(data.table)
set.seed(45)
dt <- data.table(grp = sample(letters[1:15], 1000, replace=T))
m  <- matrix(rnorm(300*1000), ncol=300)
dt <- cbind(dt, m)
setkey(dt, "grp")

,您想要查找列的平均值,例如单独使用251:300,

and you wanted to find the mean of the columns, say, 251:300 alone,


  • 您可以计算所有列的平均值,然后将这些列子集(这不是非常有效,因为您将对整个数据进行计算)。

  • you can compute the mean of all the columns and then subset these columns (which is not very efficient as you'll compute on the whole data).

dt.out <- dt[, lapply(.SD, mean), by=grp]
dim(dt.out) # 15 * 301, not efficient.


  • 您可以筛选 data.table 首先只计算这些列,然后计算平均值(这再次不一定是最佳解决方案,因为您必须在每次需要对某些列进行操作时创建一个额外的子集的data.table。

  • you can filter the data.table first to just these columns and then compute the mean (which is again not necessarily the best solution as you have to create an extra subset'd data.table every time you want operations on certain columns.

    dt.sub <- dt[, c(1, 251:300), with=FALSE]
    setkey(dt.sub, "grp")
    dt.out <- dt.sub[, lapply(.SD, mean), by=grp]
    

    li>

  • 您可以按照通常的方式逐一指定每个列(但这对于较小的数据表是合适的)

  • you can specify each of the columns one by one as you'd normally do (but this is desirable for smaller data.tables)

    # if you just need one or few columns
    dt.out <- dt[, list(m.v251 = mean(V251)), by = grp]
    


  • .SDcols


    如文档所述,对于 data.table x .SDcols 指定 .SD 中包含的列。

    As the documentation states, for a data.table x, .SDcols specifies the columns that are included in .SD.

    这基本上隐式过滤将传递给.SD而不是创建子集的列是非常有效率和快速!

    This basically implicitly filters the columns that will be passed to .SD instead of creating a subset (as we did before), only it is VERY efficient and FAST!

    我们如何做到这一点?

    How can we do this?


    • 通过指定列号:

    • By specifiying either the column numbers:

    dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = 251:300]
    dim(dt.out) # 15 * 51 (what we expect)
    


  • 或者通过指定列id:

  • Or alternatively by specifying the column id:

    ids <- paste0("V", 251:300) # get column ids
    dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = ids]
    dim(dt.out) # 15 * 51 (what we expect)
    


  • 它接受列名和数字作为参数。在这两种情况下,.SD仅提供我们指定的这些列。

    It accepts both column names and numbers as arguments. In both these cases, .SD will be provided only with these columns we've specified.

    希望这有助。

    这篇关于在`data.table`中使用动态列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆