使用 `by = .I` 在 data.table 中进行行操作 [英] Row operations in data.table using `by = .I`

查看:14
本文介绍了使用 `by = .I` 在 data.table 中进行行操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是关于 data.table<中的行操作的一个很好的 SO 解释/代码>

我想到的另一种选择是为每一行使用一个唯一的 id,然后使用 by 参数应用一个函数.像这样:

库(data.table)dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)],V1=1:5,V2=3:7,V3=5:1)# 创建具有行位置的列dt[, rowpos := .I]# 按行计算标准差dt[ , sdd := sd(.SD[, -1, with=FALSE]), by = rowpos ]

问题:

  1. 是否有充分的理由不使用这种方法?也许还有其他更有效的选择?

  2. 为什么使用 by = .I 不一样?

    dt[ , sdd := sd(.SD[, -1, with=FALSE]), by = .I ]

解决方案

注意:此答案的第 (3) 节于 2019 年 4 月更新,由于 data.table 随着时间的推移发生了许多变化,使原始版本过时.此外,从 data.table 的所有实例中删除了参数 with= 的使用,因为它已被弃用.

1) 好吧,至少对于 rowsums 示例而言,不使用它的一个原因是性能和创建不必要的列.与下面的选项 f2 相比,它几乎快 4 倍,并且不需要 rowpos 列(请注意,原始问题使用 rowSums 作为示例函数,这部分答案对此做出了响应.OP 之后编辑了问题以使用不同的功能,对此答案的第 3 部分更相关`):

dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1)f1 <- 函数(dt){dt[, rowpos := .I]dt[ , sdd := rowSums(.SD[, 2:4]), by = rowpos ] }f2 <- 函数(dt) dt[, sdd := rowSums(.SD), .SDcols= 2:4]库(微基准)微基准(f1(dt),f2(dt))# 单位:毫秒# expr min lq mean 中值 uq max neval cld# f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608 100 b# f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464 100 a

2) 关于第二个问题,虽然 dt[, sdd := sum(.SD[, 2:4]), by = .I] 没有工作,dt[, sdd := sum(.SD[, 2:4]), by = 1:NROW(dt)] 完美运行.鉴于根据 ?data.table .I 是一个等于 seq_len(nrow(x))" 的整数向量,人们可能会认为它们是等价的.然而,不同之处在于 .I 用于 j,而不是 by.注意 .I 的值是在 data.table 内部计算的,因此不能像 by=.I 那样事先作为参数值传入.p>

也可以预期 by = .I 应该只是抛出一个错误.但这不会发生,因为加载 data.table 包会在 data.table 命名空间中创建一个对象 .I,该对象可从全局环境访问,其值为NULL.您可以通过在命令提示符下键入 .I 来测试它.(注意,.SD.EACHI.N.GRP.BY)

<代码>.I# 错误:对象 '.I' 未找到库(数据表).一世# 空值数据表::.I# 空值

这样的结果是by = .I的行为等价于by = NULL.

3) 虽然我们已经在第 1 部分中看到,对于已经有效地逐行循环的 rowSums,还有比创建rowpos 列.但是,如果我们没有快速的逐行函数,那么循环呢?

使用 set() 对照 for 循环对 by = rowposby = 1:NROW(dt) 版本进行基准测试 在这里提供了丰富的信息.我们发现在 for 循环中循环 set 比使用 data.table 的 by 参数进行循环的任何一种方法都慢.但是,创建附加列的 by 循环和使用 seq_len(NROW(dt)) 的循环之间的时间差异可以忽略不计.在没有任何性能差异的情况下,似乎 f.nrow 可能更可取,但只是基于更简洁且不创建不必要的列

dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)f.rowpos <- 函数() {dt[, rowpos := .I]dt[, sdd := sum(.SD[, 2:4]), by = rowpos ]}f.nrow <- 函数() {dt[, sdd := sum(.SD[, 2:4]), by = seq_len(NROW(dt)) ]}f.forset<-函数(){for (i in seq_len(NROW(dt))) set(dt, i, 'sdd', sum(dt[i, 2:4]))}微基准(f.rowpos(),f.nrow(),f.forset(),次= 5)# 单位:毫秒# expr min lq mean 中值 uq max neval# f.rowpos() 559.1115 575.3162 580.2853 578.6865 588.5532 599.7591 5# f.nrow() 558.4327 582.4434 584.6893 587.1732 588.6689 606.7282 5# f.forset() 1172.6560 1178.8399 1298.4842 1255.4375 1292.7393 1592.7486 5

因此,总而言之,即使在没有诸如 rowSums 之类的优化函数已经按行操作的情况下,也有使用 rowpos 列的替代方法,虽然不是更快,但不需要创建冗余列.

Here is a good SO explanation about row operations in data.table

One alternative that came to my mind is to use a unique id for each row and then apply a function using the by argument. Like this:

library(data.table)

dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)],
                 V1=1:5,
                 V2=3:7,
                 V3=5:1)

# create a column with row positions
dt[, rowpos := .I]

# calculate standard deviation by row
dt[ ,  sdd := sd(.SD[, -1, with=FALSE]), by = rowpos ] 

Questions:

  1. Is there a good reason not to use this approach? perhaps other more efficient alternatives?

  2. Why does using by = .I doesn't work the same?

    dt[ , sdd := sd(.SD[, -1, with=FALSE]), by = .I ]

解决方案

Note: section (3) of this answer updated in April 2019, due to many changes in data.table over time redering the original version obsolete. Also, use of the argument with= removed from all instances of data.table, as it has since been deprecated.

1) Well, one reason not to use it, at least for the rowsums example is performance, and creation of an unnecessary column. Compare to option f2 below, which is almost 4x faster and does not need the rowpos column (Note that the original question used rowSums as the example function, to which this part of the answer responds. OP edited the question afterwards to use a different function, for which part 3 of this answer is more relevant`):

dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1)
f1 <- function(dt){
  dt[, rowpos := .I] 
  dt[ ,  sdd := rowSums(.SD[, 2:4]), by = rowpos ] }
f2 <- function(dt) dt[, sdd := rowSums(.SD), .SDcols= 2:4]

library(microbenchmark)
microbenchmark(f1(dt),f2(dt))
# Unit: milliseconds
#   expr      min       lq     mean   median       uq      max neval cld
# f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608   100   b
# f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464   100  a 

2) On your second question, although dt[, sdd := sum(.SD[, 2:4]), by = .I] does not work, dt[, sdd := sum(.SD[, 2:4]), by = 1:NROW(dt)] works perfectly. Given that according to ?data.table ".I is an integer vector equal to seq_len(nrow(x))", one might expect these to be equivalent. The difference, however, is that .I is for use in j, not in by. NB the value of .I is calculated internally in data.table, so is not available beforehand to be passed in as a parameter value as in by=.I.

It might also be expected that by = .I should just throw an error. But this does not occur, because loading the data.table package creates an object .I in the data.table namespace that is accessible from the global environment, and whose value is NULL. You can test this by typing .I at the command prompt. (Note, the same applies to .SD, .EACHI, .N, .GRP, and .BY)

.I
# Error: object '.I' not found
library(data.table)
.I
# NULL
data.table::.I
# NULL

The upshot of this is that the behaviour of by = .I is equivalent to by = NULL.

3) Although we have already seen in part 1 that in the case of rowSums, which already loops row-wise efficiently, there are much faster ways than creating the rowpos column. But what about looping when we don't have a fast row-wise function?

Benchmarking the by = rowpos and by = 1:NROW(dt) versions against a for loop with set() is informative here. We find that looping over set in a for loop is slower than either of the methods that use data.table's by argument for looping. However there is neglibible difference in timing between the by loop that creates an additional column and the one that uses seq_len(NROW(dt)). Absent any performance difference, it seems that f.nrow is probably preferable, but only on the basis of being more concise and not creating an unnecessary column

dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)

f.rowpos <- function() {
  dt[, rowpos := .I] 
  dt[,  sdd := sum(.SD[, 2:4]), by = rowpos ] 
}

f.nrow <- function() {
  dt[, sdd := sum(.SD[, 2:4]), by = seq_len(NROW(dt)) ]
}

f.forset<- function() {
  for (i in seq_len(NROW(dt))) set(dt, i, 'sdd', sum(dt[i, 2:4]))
}

microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5)
# Unit: milliseconds
#       expr       min        lq      mean    median        uq       max neval
# f.rowpos()  559.1115  575.3162  580.2853  578.6865  588.5532  599.7591     5
#   f.nrow()  558.4327  582.4434  584.6893  587.1732  588.6689  606.7282     5
# f.forset() 1172.6560 1178.8399 1298.4842 1255.4375 1292.7393 1592.7486     5

So, in conclusion, even in situations where there is not an optimised function such as rowSums that already operates by row, there are alternatives to using a rowpos column that, although not faster, don't require creation of a redundant column.

这篇关于使用 `by = .I` 在 data.table 中进行行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆