我应该使用 mget()、.. 还是 with=FALSE 来选择 data.table 的列? [英] Should I use mget(), .. or with=FALSE to select columns of a data.table?

查看:13
本文介绍了我应该使用 mget()、.. 还是 with=FALSE 来选择 data.table 的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过使用包含所需列名的变量(with=FALSE, .., mget, ...).

There are multiple ways to select columns of data.table by using a variable holding the desired column names (with=FALSE, .., mget, ...).

是否有共识(何时)使用?data.table-y 比其他的多一个吗?

Is there a consensus which to use (when)? Is one more data.table-y than the others?

我可以提出以下论点:

  1. with=FALSE.. 几乎同样快,而 mget 较慢
  2. .. 不能即时"选择连接的列名(EDIT:当前 CRAN 版本 1.12.8 绝对可以,我用的是旧版本,不能,所以这个说法是有缺陷的)
  3. mget() 接近于 get() 的有用语法,这似乎是在 j 计算中使用变量名的唯一方法
  1. with=FALSE and .. are almost equally fast, while mget is slower
  2. .. can't select concatenated column names "on the fly" (EDIT: current CRAN version 1.12.8 definitely can, I was using an old version, which could not, so this argument is flawed)
  3. mget() is close to the useful syntax of get(), which seems to be the only way to use a variable name in a calculation in j

到(1):

library(data.table)
library(microbenchmark)

a <- mtcars
setDT(a)

selected_cols <- names(a)[1:4]

microbenchmark(a[, mget(selected_cols)],
               a[, selected_cols, with = FALSE],
               a[, ..selected_cols],
               a[, .SD, .SDcols = selected_cols])

#Unit: microseconds
#                             expr     min       lq     mean   median       uq      max neval cld
#          a[, mget(selected_cols)] 468.483 495.6455 564.2953 504.0035 515.4980 4341.768   100   c
#  a[, selected_cols, with = FALSE] 106.254 118.9385 141.0916 124.6670 130.1820  966.151   100 a  
#              a[, ..selected_cols] 112.532 123.1285 221.6683 129.9050 136.6115 2137.900   100 a  
# a[, .SD, .SDcols = selected_cols] 277.536 287.6915 402.2265 293.1465 301.3990 5231.872   100  b 

到(2):

b <- data.table(x = rnorm(1e6), 
                y = rnorm(1e6, mean = 2, sd = 4), 
                z = sample(LETTERS, 1e6, replace = TRUE))

selected_col <- "y"

microbenchmark(b[, mget(c("x", selected_col))],
               b[, c("x", selected_col), with = FALSE],
               b[, c("x", ..selected_col)])
# Unit: milliseconds
#                                    expr      min       lq      mean   median       uq      max neval cld
#         b[, mget(c("x", selected_col))] 5.454126 7.160000 21.752385 7.771202 9.301334 147.2055   100   b
# b[, c("x", selected_col), with = FALSE] 2.520474 2.652773  7.764255 2.944302 4.430173 100.3247   100  a 
#             b[, c("x", ..selected_col)] 2.544475 2.724270 14.973681 4.038983 4.634615 218.6010   100  ab

到(3):

b[, sqrt(get(selected_col))][1:5]
# [1] NaN 1.3553462 0.7544402 1.5791845 1.1007728

b[, sqrt(..selected_col)]
# error

b[, sqrt(selected_col), with = FALSE]
# error

EDIT:将 .SDcols 添加到 (1) 中的基准测试中,b[, c("x", ..selected_col)] 到 (2).

EDIT: added .SDcols to the benchmark in (1), b[, c("x", ..selected_col)] to (2).

推荐答案

我应该使用 mget()、.. 还是 with=FALSE 来选择 data.table 的列?

Should I use mget(), .. or with=FALSE to select columns of a data.table?

你应该使用你喜欢的任何东西,只要它当然不被弃用.当提出的解决方案之间的性能差异会产生真正的影响时,我看不到任何实际的用例.这些是在其他接口上使用 with=FALSE 的一些参数,但这些参数与这些接口的维护更相关,而不是真正的用户使用.

You should use whatever is your preference, as long as it is not deprecated of course. I don't see any realistic use case when performance differences across presented solutions would be making real difference. The are some arguments for using with=FALSE over other interfaces but those are more related to maintenance of those interfaces, and not really user usage.

在最近的 data.table 版本中,从 1.14.1 开始,有一个新功能可以使用 data.table 以启用深度参数化 data.table 查询.这个新接口,我们称之为env arg";可用于解决您问题中的问题.是的,另一种解决问题的方法.这个 env arg 接口更加通用,所以在这样一个简单的用例中,我仍然会使用 with=FALSE.下面我将 verbose=TRUE 添加到这个新的接口用法中,以便读者可以看到如何预处理查询以替换变量.

In recent data.table version, starting from 1.14.1, there is a new feature for working with data.table in a way that enables deep parameterizing data.table queries. This new interface, let's call it "env arg" can be used to solve the problem in your question. Yes, another way to solve your problem. This env arg interface is much more generic, so in such a simply use case I would still use with=FALSE. Below I added verbose=TRUE to this new interface usage so readers can see how queries were pre-processed for substitutions of variables.

b = data.table(x = 1L, y = 2, z = "c")
selected_col = "y"

b[, c("x", selected_col), with=FALSE]
#       x     y
#   <int> <num>
#1:     1     2

b[, .cols, env=list(.cols=I(c("x",selected_col))), verbose=T]
#Argument 'j'  after substitute: c("x", "y")
#       x     y
#   <int> <num>
#1:     1     2

b[, .cols, env=list(.cols=as.list(c("x",selected_col))), verbose=T]
#Argument 'j'  after substitute: list(x, y)
#       x     y
#   <int> <num>
#1:     1     2

新的 env 接口也将很好地支持 (3)

New env interface will also nicely support (3)

b[, sqrt(.col), env=list(.col=selected_col), verbose=T]
#Argument 'j'  after substitute: sqrt(y)
#[1] 1.414214

这篇关于我应该使用 mget()、.. 还是 with=FALSE 来选择 data.table 的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆