在 `data.table` 中使用动态列名 [英] Using dynamic column names in `data.table`

查看:17
本文介绍了在 `data.table` 中使用动态列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算 data.table 中几列中每一列的平均值,并按另一列分组.我的问题类似于 SO 上的其他两个问题(一个两个) 但我不能不要将这些应用到我的问题上.

I want to calculate mean of each of several columns in a data.table, grouped by another column. My question is similar to two other questions on SO (one and two) but I couldn't apply those on my problem.

这是一个例子:

library(data.table)
dtb <- fread(input = "condition,var1,var2,var3
      one,100,1000,10000
      one,101,1001,10001
      one,102,1002,10002
      two,103,1003,10003
      two,104,1004,10004
      two,105,1005,10005
      three,106,1006,10006
      three,107,1007,10007
      three,108,1008,10008
      four,109,1009,10009
      four,110,1010,10010")

dtb
#    condition var1 var2  var3
# 1:       one  100 1000 10000
# 2:       one  101 1001 10001
# 3:       one  102 1002 10002
# 4:       two  103 1003 10003
# 5:       two  104 1004 10004
# 6:       two  105 1005 10005
# 7:     three  106 1006 10006
# 8:     three  107 1007 10007
# 9:     three  108 1008 10008
# 10:     four  109 1009 10009
# 11:     four  110 1010 10010

每个单一平均值的计算很容易;例如对于var1":dtb[ , mean(var1), by = condition].但是如果有很多变量并且您需要编写所有变量,这很快就会变得很麻烦.因此,dtb[, list(mean(var1), mean(var2), mean(var3)), by = condition] 是不可取的.我需要动态的列名,我希望得到这样的结果:

The calculation of each single mean is easy; e.g. for "var1": dtb[ , mean(var1), by = condition]. But I this quickly becomes cumbersome if there are many variables and you need to write all of them. Thus, dtb[, list(mean(var1), mean(var2), mean(var3)), by = condition] is undesirable. I need the column names to be dynamic and I wish to end up with something like this:

   condition  var1   var2    var3
1:       one 101.0 1001.0 10001.0
2:       two 104.0 1004.0 10004.0
3:     three 107.0 1007.0 10007.0
4:      four 109.5 1009.5 10009.5

推荐答案

你应该使用 .SDcols列的子集(分组变量列除外).

you should use .SDcols (especially if you've too many columns and you require a particular operation to be performed only on a subset of the columns (apart from the grouping variable columns).

dtb[, lapply(.SD, mean), by=condition, .SDcols=2:4]

#    condition  var1   var2    var3
# 1:       one 101.0 1001.0 10001.0
# 2:       two 104.0 1004.0 10004.0
# 3:     three 107.0 1007.0 10007.0
# 4:      four 109.5 1009.5 10009.5

您还可以首先在变量中获取所有想要取平均值的列名,然后将其传递给 .SDcols,如下所示:

You could also get all the column names you'd want to take mean of first in a variable and then pass it to .SDcols like this:

keys <- setdiff(names(dtb), "condition")
# keys = var1, var2, var3
dtb[, lapply(.SD, mean), by=condition, .SDcols=keys]

正如 Matthew Dowle 正确指出的那样,由于您需要在按 condition 分组后每隔一列计算均值,因此您可以这样做:

As Matthew Dowle rightly pointed out, since you require mean to be computed on every other column after grouping by condition, you could just do:

dtb[, lapply(.SD, mean), by=condition]

David 的(被拒绝):从 这篇文章.我发现这在这里是相关的.谢谢@大卫.

David's edit: (which got rejected): Read more about .SD from this post. I find this is relevant here. Thanks @David.

编辑 2:假设您有一个 data.table,其中包含 1000 行和 301 列(一列用于分组和 300 个数字列):

Edit 2: Suppose you have a data.table with 1000 rows and 301 columns (one column for grouping and 300 numeric columns):

require(data.table)
set.seed(45)
dt <- data.table(grp = sample(letters[1:15], 1000, replace=T))
m  <- matrix(rnorm(300*1000), ncol=300)
dt <- cbind(dt, m)
setkey(dt, "grp")

而您想找到列的平均值,例如 251:300,

and you wanted to find the mean of the columns, say, 251:300 alone,

  • 您可以计算所有列的平均值,然后对这些列进行子集化(这不是很有效,因为您将在整个数据上进行计算).

  • you can compute the mean of all the columns and then subset these columns (which is not very efficient as you'll compute on the whole data).

dt.out <- dt[, lapply(.SD, mean), by=grp]
dim(dt.out) # 15 * 301, not efficient.

  • 您可以先将 data.table 过滤为这些列,然后计算平均值(这也不一定是最佳解决方案,因为您必须创建额外的子集数据.table 每次您想要对某些列进行操作时.

  • you can filter the data.table first to just these columns and then compute the mean (which is again not necessarily the best solution as you have to create an extra subset'd data.table every time you want operations on certain columns.

    dt.sub <- dt[, c(1, 251:300)]
    setkey(dt.sub, "grp")
    dt.out <- dt.sub[, lapply(.SD, mean), by=grp]
    

  • 您可以像往常一样一一指定每一列(但这对于较小的 data.tables 是可取的)

  • you can specify each of the columns one by one as you'd normally do (but this is desirable for smaller data.tables)

    # if you just need one or few columns
    dt.out <- dt[, list(m.v251 = mean(V251)), by = grp]
    

  • 那么最好的解决方案是什么?答案是 .SDcols.

    So what's the best solution? The answer is .SDcols.

    如文档所述,对于 data.table x.SDcols 指定包含在 .SD 中的列.

    As the documentation states, for a data.table x, .SDcols specifies the columns that are included in .SD.

    这基本上隐式过滤将传递给 .SD 的列,而不是创建一个子集(就像我们之前所做的那样),只是它非常高效和快速!

    This basically implicitly filters the columns that will be passed to .SD instead of creating a subset (as we did before), only it is VERY efficient and FAST!

    我们怎样才能做到这一点?

    How can we do this?

    • 通过指定列号:

    • By specifiying either the column numbers:

    dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = 251:300]
    dim(dt.out) # 15 * 51 (what we expect)
    

  • 或者通过指定列 id:

  • Or alternatively by specifying the column id:

    ids <- paste0("V", 251:300) # get column ids
    dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = ids]
    dim(dt.out) # 15 * 51 (what we expect)
    

  • 它接受列名和数字作为参数.在这两种情况下,.SD 将仅与我们指定的这些列一起提供.

    It accepts both column names and numbers as arguments. In both these cases, .SD will be provided only with these columns we've specified.

    希望这会有所帮助.

    这篇关于在 `data.table` 中使用动态列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆