如何在函数和循环中使用 data.table? [英] How to use data.table within functions and loops?

查看:30
本文介绍了如何在函数和循环中使用 data.table?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在评估 data.table(与 dplyr)的实用性时,一个关键因素是在函数和循环中使用它的能力.
为此,我修改了这篇文章中使用的代码片段:data.table vs dplyr:一个人能不能做得很好,另一个不能或做得不好? 这样,而不是硬编码的数据集变量名称("钻石"数据集的cut"和price"变量),它变得与数据集无关——cut-n-paste 准备好在任何函数或循环中使用(当我们事先不知道列名时).

While assessing the utility of data.table (vs. dplyr), a critical factor is the ability to use it within functions and loops.
For this, I've modified the code snippet used in this post: data.table vs dplyr: can one do something well the other can't or does poorly? so that, instead of hard-coded dataset variables names ("cut" and "price" variables of "diamonds" dataset), it becomes dataset-agnostic - cut-n-paste ready for the use inside any function or a loop (when we don't know column names in advance).

这是原始代码:

library(data.table)
dt <- data.table(ggplot2::diamonds)
dt[cut != "Fair", .(mean(price),.N), by = cut]  

这是它与数据集无关的等效项:

This is its dataset-agnostic equivalent:

dt <- data.table(diamonds)
nVarGroup <- 2 #"cut"
nVarMeans <- 7 #"price"

strGroupConditions <- levels(dt[[nVarGroup]])[-1] # "Good" "Very Good" "Premium" "Ideal" 
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans))) #! w/o get() it does not work! 
qGroup <- quote(get(strVarGroup) %in% strGroupConditions) #! w/o get() it does not work! 
dt[eval(qGroup), .(eval(qAction), .N), by = strVarGroup]

注意(感谢下方回复):如果需要通过引用改变变量值,需要使用(),而不是get(),如下图:

Note (Thanks to reply below): if you need to change variable value by reference, you need to use (), not get(), as shown below:

strVarToBeReplaced <- names(dt)[1]
dt[eval(qGroup), (strVarToBeReplaced) := eval(qAction), by = strGroup][] 

现在:您可以剪切-粘贴以下代码以满足您的所有循环需求:

Now: you can cut-n-paste the following code for all your looping needs:

for(nVarGroup in 2:4)       # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions <- levels(dt[[nVarGroup]])[-1] 
    strVarGroup <- names(dt)[nVarGroup]
    strVarMeans <- names(dt)[nVarMeans]
    qAction  <- quote(mean(get(strVarMeans))) 
    qGroup <- quote(get(strVarGroup) %in% strGroupConditions) 
    p <- dt[eval(qGroup), .(AVE=eval(qAction), COUNT=.N), by = strVarGroup]

    print(sprintf("nVaGroup=%s, nVarMeans=%s: ", strVarGroup, strVarMeans))
    print(p)
  }

我的第一个问题:
上面的代码虽然满足了所需的功能/循环需求,但看起来相当复杂.- 它使用不同的多个(可能不一致的)非直观技巧,例如 ()get()quote()/eval(), [[]]).对于如此简单的需求来说似乎太多了......

My first question:
The code above, while enabling the required functional/looping needs, appears quite convoluted. - It uses different multiple (possibly non-consistent) non-intuitive tricks such combination of (), get(), quote()/eval(), [[]]). Seems too many for a such straightforward need...

是否有另一种更好的方法来访问和修改循环中的 data.tables 值? 也许使用 on=lapply/.SD/.SDcols?

请在下方分享您的想法.本次讨论旨在补充和整合其他帖子中的相关内容(例如此处列出的:如何完全通用地在 R 中的 data.table 中使用变量中的列名).最终,创建一个专用小插图以在 functionsloops 中使用 data.table 会很棒.

Please share your ideas below. This discussion aims to supplement and consolidate related bits from other posts (such as listed here: How can one work fully generically in data.table in R with column names in variables). Eventually, it would be great to create a dedicated vignette for using data.table within functions and loops.

第二个问题:
dplyr 是否更容易用于此目的? - 然而,对于这个问题,我已经设置了一个单独的帖子:dplyr 是否比 data.table 更容易在函数和循环中使用?.

The second question:
Is dplyr easier for this purpose? - For this question however, I've set a separate post: Is dplyr easier than data.table to be used within functions and loops?.

推荐答案

这可能不是最像 data.table 或最快的解决方案,但我会简化此中的代码 特定的循环如下:

This might not be the most data.table-like or the fastest solution but I would streamline the code in this particular loop as follows:

for(nVarGroup in 2:4) {      # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions <- levels(dt[[nVarGroup]])[-1] 
    strVarGroup <- names(dt)[nVarGroup]
    strVarMeans <- names(dt)[nVarMeans]
    # qAction <- quote(mean(get(strVarMeans)))
    # qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
    # p <- dt[eval(qGroup), .(AVE = eval(qAction), COUNT = .N), by = strVarGroup]
    setkeyv(dt, strVarGroup)
    p <- dt[strGroupConditions, .(AVE = lapply(.SD, mean), COUNT = .N), by = strVarGroup, 
            .SDcols = strVarMeans]

    print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
    print(p)
  }
}

我已将旧代码作为注释留作参考.

I've left the old code as comment for reference.

qAction 被替换为 lapply(.SD, mean).SDcols 参数.

qAction is replaced by using lapply(.SD, mean) together with the .SDcols parameter.

qGroup 用于子集行被设置键和提供所需值的向量作为 i 参数的组合替换.

qGroup for subsetting rows is replaced by the combination of setting a key and providing the vector of desired values as i parameter.

如果是更复杂的子集表达式,我会尝试使用 on= 语法使用非等值(或条件)连接.

In case of a more complex subsetting expression I would try use non-equi (or conditional) joins using the on= syntax.

或者,按照Matt Dowle 的建议创建一个要评估的表达式,类似构建动态 SQL 语句以发送到服务器".

Or, follow Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server".

Matt 建议创建一个辅助函数

Matt suggested to create a helper function

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))

可以结合来自gsubfn包的fn$的准perl类型字符串插值,以提高EVAL解决方案的可读性"作为由 G. Grothendieck 建议.

which can be combined with the "quasi-perl type string interpolation of fn$ from the gsubfn package to improve the readability of the EVAL solution" as suggested by G. Grothendieck.

这样,循环的代码最终变成了:

With this, the code for the loop becomes eventually:

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
library(gsubfn)

for(nVarGroup in 2:4) {      # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions = levels(dt[[nVarGroup]])[-1] 
    strVarGroup = names(dt)[nVarGroup]
    strVarMeans = names(dt)[nVarMeans]
    p <- fn$EVAL("dt[$strVarGroup %in% strGroupConditions, .(AVE=mean($strVarMeans), COUNT=.N), by = strVarGroup]" )

    print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
    print(p)
  }
}

现在,data.table 语句看起来很像一个原生"语句,除了 $strVarGroup$strVarMeans 用于引用变量的内容.

Now, the data.table statement looks pretty much like a "native" statement except that $strVarGroup and $strVarMeans is used where the contents of variables is referenced.

在 1.1.0 版本(CRAN 于 2016-08-19 发布)中,stringr 包获得了一个字符串插值函数 str_interp(),它可以替代gsubfn 包在这里.

With version 1.1.0 (CRAN release on 2016-08-19), the stringr package has gained a string interpolation function str_interp() which is an alternative to the gsubfn package here.

使用str_interp(),for循环中的中心语句将变成

With str_interp(), the central statement in the for loop would become

p <- EVAL(stringr::str_interp(
  "dt[${strVarGroup} %in% strGroupConditions, .(AVE=mean(${strVarMeans}), COUNT=.N), by = strVarGroup]"
  ))

并且可以删除对 library(gsubfn) 的调用.

and the call to library(gsubfn) could be removed.

这篇关于如何在函数和循环中使用 data.table?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆