如何在函数和循环中使用data.table? [英] How to use data.table within functions and loops?
问题描述
在评估 data.table
(与 dplyr
)的效用时,一个关键因素是在函数和循环中使用它。
为此,我修改了这篇文章中使用的代码片段: data.table vs dplyr:可以做一些不好的事情吗? >,而不是硬编码的数据集变量名称(切割和价格变量的钻石数据集),它成为数据集 - 不可知 - 剪切 - 粘贴准备在任何函数或循环内使用当我们不提前知道列名时)。
While assessing the utility of data.table
(vs. dplyr
), a critical factor is the ability to use it within functions and loops.
For this, I've modified the code snippet used in this post: data.table vs dplyr: can one do something well the other can't or does poorly? so that, instead of hard-coded dataset variables names ("cut" and "price" variables of "diamonds" dataset), it becomes dataset-agnostic - cut-n-paste ready for the use inside any function or a loop (when we don't know column names in advance).
这是原始代码:
library(data.table)
dt <- data.table(ggplot2::diamonds)
dt[cut != "Fair", .(mean(price),.N), by = cut]
这是其数据集不可知的等价物:
This is its dataset-agnostic equivalent:
dt <- data.table(diamonds)
nVarGroup <- 2 #"cut"
nVarMeans <- 7 #"price"
strGroupConditions <- levels(dt[[nVarGroup]])[-1] # "Good" "Very Good" "Premium" "Ideal"
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans))) #! w/o get() it does not work!
qGroup <- quote(get(strVarGroup) %in% strGroupConditions) #! w/o get() it does not work!
dt[eval(qGroup), .(eval(qAction), .N), by = strVarGroup]
注意(感谢下面的回复):如果您需要通过引用更改变量值,则需要使用()
而不是 get()
,如下所示:
Note (Thanks to reply below): if you need to change variable value by reference, you need to use ()
, not get()
, as shown below:
strVarToBeReplaced <- names(dt)[1]
dt[eval(qGroup), (strVarToBeReplaced) := eval(qAction), by = strGroup][]
现在,您可以为以下所有循环需求剪切以下代码:
Now: you can cut-n-paste the following code for all your looping needs:
for(nVarGroup in 2:4) # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions <- levels(dt[[nVarGroup]])[-1]
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans)))
qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
p <- dt[eval(qGroup), .(AVE=eval(qAction), COUNT=.N), by = strVarGroup]
print(sprintf("nVaGroup=%s, nVarMeans=%s: ", strVarGroup, strVarMeans))
print(p)
}
我的第一个问题:
代码在启用所需的功能/循环需求的同时,显得非常复杂。 - 它使用不同的多个(可能不一致的)非直觉技巧,例如()
, get()
, quote()
/ eval()
, [[]]
)。似乎太多了这么简单的需要...
My first question:
The code above, while enabling the required functional/looping needs, appears quite convoluted. - It uses different multiple (possibly non-consistent) non-intuitive tricks such combination of ()
, get()
, quote()
/eval()
, [[]]
). Seems too many for a such straightforward need...
是否有更好的方式访问和修改数据循环中的data.tables值?也许与 on =
, lapply
/ .SD
/ .SDcols
?
Is there another better way of accessing and modifying data.tables values in loops? Perhaps with on=
, lapply
/.SD
/.SDcols
?
请在下面分享您的想法。这个讨论旨在补充和整合其他帖子中的相关位(例如:如何使用变量中的列名称,在R中的data.table中完全一致地工作)。最终,在函数
和 data.table
>环。
Please share your ideas below. This discussion aims to supplement and consolidate related bits from other posts (such as listed here: How can one work fully generically in data.table in R with column names in variables). Eventually, it would be great to create a dedicated vignette for using data.table
within functions
and loops
.
第二个问题:
为了这个目的,dplyr更容易吗? - 对于这个问题,我已经设置单独的帖子:在函数中使用的数据容易比data.table容易和循环?。
The second question:
Is dplyr easier for this purpose? - For this question however, I've set a separate post: Is dplyr easier than data.table to be used within functions and loops?.
推荐答案
这可能不是最多的 data.table
- 或最快的解决方案,但我会简化这个特定循环中的代码,如下所示:
This might not be the most data.table
-like or the fastest solution but I would streamline the code in this particular loop as follows:
for(nVarGroup in 2:4) { # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions <- levels(dt[[nVarGroup]])[-1]
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
# qAction <- quote(mean(get(strVarMeans)))
# qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
# p <- dt[eval(qGroup), .(AVE = eval(qAction), COUNT = .N), by = strVarGroup]
setkeyv(dt, strVarGroup)
p <- dt[strGroupConditions, .(AVE = lapply(.SD, mean), COUNT = .N), by = strVarGroup,
.SDcols = strVarMeans]
print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
print(p)
}
}
我已经把旧的代码作为注释作为参考。
I've left the old code as comment for reference.
qAction
替换为使用 lapply(.SD,mean)
连同 .SDcols
参数。
qAction
is replaced by using lapply(.SD, mean)
together with the .SDcols
parameter.
qGroup
用于子集化行被替换为设置键并将所需值的向量提供为 i
参数。
qGroup
for subsetting rows is replaced by the combination of setting a key and providing the vector of desired values as i
parameter.
如果更复杂的子集表达式,我将尝试使用非等价(或条件)连接使用 on =
语法。
In case of a more complex subsetting expression I would try use non-equi (or conditional) joins using the on=
syntax.
或者,按照 Matt Dowle的建议创建一个要评估的表达式,类似于构建要发送到服务器的动态SQL语句。 Matt建议创建一个助手功能
Or, follow Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server".
Matt suggested to create a helper function
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
可以从 gsubfn $ c中的code> fn $
的准 - perl类型字符串插值 $ c>包以提高EVAL解决方案的可读性as G.GGthendieck建议。
which can be combined with the "quasi-perl type string interpolation of fn$
from the gsubfn
package to improve the readability of the EVAL solution" as suggested by G. Grothendieck.
为此,循环的代码最终将成为:
With this, the code for the loop becomes eventually:
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
library(gsubfn)
for(nVarGroup in 2:4) { # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions = levels(dt[[nVarGroup]])[-1]
strVarGroup = names(dt)[nVarGroup]
strVarMeans = names(dt)[nVarMeans]
p <- fn$EVAL("dt[$strVarGroup %in% strGroupConditions, .(AVE=mean($strVarMeans), COUNT=.N), by = strVarGroup]" )
print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
print(p)
}
}
现在, data.table
语句看起来非常像一个native语句,除了 $ strVarGroup
和 $ strVarMeans
用于引用变量的内容。
Now, the data.table
statement looks pretty much like a "native" statement except that $strVarGroup
and $strVarMeans
is used where the contents of variables is referenced.
与版本1.1.0(在08年8月的CRAN版本), stringr
package已经获得了一个字符串插值函数 str_interp()
这是 gsubfn
With version 1.1.0 (CRAN release on 2016-08-19), the stringr
package has gained a string interpolation function str_interp()
which is an alternative to the gsubfn
package here.
对于 str_interp()
,for循环中的中心语句将成为
With str_interp()
, the central statement in the for loop would become
p <- EVAL(stringr::str_interp(
"dt[${strVarGroup} %in% strGroupConditions, .(AVE=mean(${strVarMeans}), COUNT=.N), by = strVarGroup]"
))
,可以删除对库(gsubfn)
的调用。
这篇关于如何在函数和循环中使用data.table?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!