如何在函数和循环中使用 data.table? [英] How to use data.table within functions and loops?
问题描述
在评估 data.table
(与 dplyr
)的实用性时,一个关键因素是在函数和循环中使用它的能力.
为此,我修改了这篇文章中使用的代码片段:data.table vs dplyr:一个人能不能做得很好,另一个不能或做得不好? 这样,而不是硬编码的数据集变量名称("钻石"数据集的cut"和price"变量),它变得与数据集无关——cut-n-paste 准备好在任何函数或循环中使用(当我们事先不知道列名时).
While assessing the utility of data.table
(vs. dplyr
), a critical factor is the ability to use it within functions and loops.
For this, I've modified the code snippet used in this post: data.table vs dplyr: can one do something well the other can't or does poorly? so that, instead of hard-coded dataset variables names ("cut" and "price" variables of "diamonds" dataset), it becomes dataset-agnostic - cut-n-paste ready for the use inside any function or a loop (when we don't know column names in advance).
这是原始代码:
library(data.table)
dt <- data.table(ggplot2::diamonds)
dt[cut != "Fair", .(mean(price),.N), by = cut]
这是它与数据集无关的等效项:
This is its dataset-agnostic equivalent:
dt <- data.table(diamonds)
nVarGroup <- 2 #"cut"
nVarMeans <- 7 #"price"
strGroupConditions <- levels(dt[[nVarGroup]])[-1] # "Good" "Very Good" "Premium" "Ideal"
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans))) #! w/o get() it does not work!
qGroup <- quote(get(strVarGroup) %in% strGroupConditions) #! w/o get() it does not work!
dt[eval(qGroup), .(eval(qAction), .N), by = strVarGroup]
注意(感谢下方回复):如果需要通过引用改变变量值,需要使用()
,而不是get()
,如下图:
Note (Thanks to reply below): if you need to change variable value by reference, you need to use ()
, not get()
, as shown below:
strVarToBeReplaced <- names(dt)[1]
dt[eval(qGroup), (strVarToBeReplaced) := eval(qAction), by = strGroup][]
现在:您可以剪切-粘贴以下代码以满足您的所有循环需求:
Now: you can cut-n-paste the following code for all your looping needs:
for(nVarGroup in 2:4) # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions <- levels(dt[[nVarGroup]])[-1]
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans)))
qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
p <- dt[eval(qGroup), .(AVE=eval(qAction), COUNT=.N), by = strVarGroup]
print(sprintf("nVaGroup=%s, nVarMeans=%s: ", strVarGroup, strVarMeans))
print(p)
}
我的第一个问题:
上面的代码虽然满足了所需的功能/循环需求,但看起来相当复杂.- 它使用不同的多个(可能不一致的)非直观技巧,例如 ()
、get()
、quote()
/eval()
, [[]]
).对于如此简单的需求来说似乎太多了......
My first question:
The code above, while enabling the required functional/looping needs, appears quite convoluted. - It uses different multiple (possibly non-consistent) non-intuitive tricks such combination of ()
, get()
, quote()
/eval()
, [[]]
). Seems too many for a such straightforward need...
是否有另一种更好的方法来访问和修改循环中的 data.tables 值? 也许使用 on=
、lapply
/.SD
/.SDcols
?
请在下方分享您的想法.本次讨论旨在补充和整合其他帖子中的相关内容(例如此处列出的:如何完全通用地在 R 中的 data.table 中使用变量中的列名).最终,创建一个专用小插图以在 functions
和 loops
中使用 data.table
会很棒.
Please share your ideas below. This discussion aims to supplement and consolidate related bits from other posts (such as listed here: How can one work fully generically in data.table in R with column names in variables). Eventually, it would be great to create a dedicated vignette for using data.table
within functions
and loops
.
第二个问题:
dplyr 是否更容易用于此目的? - 然而,对于这个问题,我已经设置了一个单独的帖子:dplyr 是否比 data.table 更容易在函数和循环中使用?.
The second question:
Is dplyr easier for this purpose? - For this question however, I've set a separate post: Is dplyr easier than data.table to be used within functions and loops?.
推荐答案
这可能不是最像 data.table
或最快的解决方案,但我会简化此中的代码 特定的循环如下:
This might not be the most data.table
-like or the fastest solution but I would streamline the code in this particular loop as follows:
for(nVarGroup in 2:4) { # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions <- levels(dt[[nVarGroup]])[-1]
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
# qAction <- quote(mean(get(strVarMeans)))
# qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
# p <- dt[eval(qGroup), .(AVE = eval(qAction), COUNT = .N), by = strVarGroup]
setkeyv(dt, strVarGroup)
p <- dt[strGroupConditions, .(AVE = lapply(.SD, mean), COUNT = .N), by = strVarGroup,
.SDcols = strVarMeans]
print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
print(p)
}
}
我已将旧代码作为注释留作参考.
I've left the old code as comment for reference.
qAction
被替换为 lapply(.SD, mean)
和 .SDcols
参数.
qAction
is replaced by using lapply(.SD, mean)
together with the .SDcols
parameter.
qGroup
用于子集行被设置键和提供所需值的向量作为 i
参数的组合替换.
qGroup
for subsetting rows is replaced by the combination of setting a key and providing the vector of desired values as i
parameter.
如果是更复杂的子集表达式,我会尝试使用 on=
语法使用非等值(或条件)连接.
In case of a more complex subsetting expression I would try use non-equi (or conditional) joins using the on=
syntax.
或者,按照Matt Dowle 的建议创建一个要评估的表达式,类似构建动态 SQL 语句以发送到服务器".
Or, follow Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server".
Matt 建议创建一个辅助函数
Matt suggested to create a helper function
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
可以结合来自gsubfn
包的fn$
的准perl类型字符串插值,以提高EVAL解决方案的可读性"作为由 G. Grothendieck 建议.
which can be combined with the "quasi-perl type string interpolation of fn$
from the gsubfn
package to improve the readability of the EVAL solution" as suggested by G. Grothendieck.
这样,循环的代码最终变成了:
With this, the code for the loop becomes eventually:
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
library(gsubfn)
for(nVarGroup in 2:4) { # Grouped by several categorical values...
for(nVarMeans in 5:10) { # ... get means of all numerical parameters
strGroupConditions = levels(dt[[nVarGroup]])[-1]
strVarGroup = names(dt)[nVarGroup]
strVarMeans = names(dt)[nVarMeans]
p <- fn$EVAL("dt[$strVarGroup %in% strGroupConditions, .(AVE=mean($strVarMeans), COUNT=.N), by = strVarGroup]" )
print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
print(p)
}
}
现在,data.table
语句看起来很像一个原生"语句,除了 $strVarGroup
和 $strVarMeans
用于引用变量的内容.
Now, the data.table
statement looks pretty much like a "native" statement except that $strVarGroup
and $strVarMeans
is used where the contents of variables is referenced.
在 1.1.0 版本(CRAN 于 2016-08-19 发布)中,stringr
包获得了一个字符串插值函数 str_interp()
,它可以替代gsubfn
包在这里.
With version 1.1.0 (CRAN release on 2016-08-19), the stringr
package has gained a string interpolation function str_interp()
which is an alternative to the gsubfn
package here.
使用str_interp()
,for循环中的中心语句将变成
With str_interp()
, the central statement in the for loop would become
p <- EVAL(stringr::str_interp(
"dt[${strVarGroup} %in% strGroupConditions, .(AVE=mean(${strVarMeans}), COUNT=.N), by = strVarGroup]"
))
并且可以删除对 library(gsubfn)
的调用.
and the call to library(gsubfn)
could be removed.
这篇关于如何在函数和循环中使用 data.table?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!