如何在函数和循环中使用data.table? [英] How to use data.table within functions and loops?

查看:127
本文介绍了如何在函数和循环中使用data.table?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在评估 data.table (与 dplyr )的效用时,一个关键因素是在函数和循环中使用它。

为此,我修改了这篇文章中使用的代码片段: data.table vs dplyr:可以做一些不好的事情吗? >,而不是硬编码的数据集变量名称(切割和价格变量的钻石数据集),它成为数据集 - 不可知 - 剪切 - 粘贴准备在任何函数或循环内使用当我们不提前知道列名时)。

While assessing the utility of data.table (vs. dplyr), a critical factor is the ability to use it within functions and loops.
For this, I've modified the code snippet used in this post: data.table vs dplyr: can one do something well the other can't or does poorly? so that, instead of hard-coded dataset variables names ("cut" and "price" variables of "diamonds" dataset), it becomes dataset-agnostic - cut-n-paste ready for the use inside any function or a loop (when we don't know column names in advance).

这是原始代码:

library(data.table)
dt <- data.table(ggplot2::diamonds)
dt[cut != "Fair", .(mean(price),.N), by = cut]  

这是其数据集不可知的等价物:

This is its dataset-agnostic equivalent:

dt <- data.table(diamonds)
nVarGroup <- 2 #"cut"
nVarMeans <- 7 #"price"

strGroupConditions <- levels(dt[[nVarGroup]])[-1] # "Good" "Very Good" "Premium" "Ideal" 
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans))) #! w/o get() it does not work! 
qGroup <- quote(get(strVarGroup) %in% strGroupConditions) #! w/o get() it does not work! 
dt[eval(qGroup), .(eval(qAction), .N), by = strVarGroup]

注意(感谢下面的回复):如果您需要通过引用更改变量值,则需要使用()而不是 get(),如下所示:

Note (Thanks to reply below): if you need to change variable value by reference, you need to use (), not get(), as shown below:

strVarToBeReplaced <- names(dt)[1]
dt[eval(qGroup), (strVarToBeReplaced) := eval(qAction), by = strGroup][] 

现在,您可以为以下所有循环需求剪切以下代码:

Now: you can cut-n-paste the following code for all your looping needs:

for(nVarGroup in 2:4)       # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions <- levels(dt[[nVarGroup]])[-1] 
    strVarGroup <- names(dt)[nVarGroup]
    strVarMeans <- names(dt)[nVarMeans]
    qAction  <- quote(mean(get(strVarMeans))) 
    qGroup <- quote(get(strVarGroup) %in% strGroupConditions) 
    p <- dt[eval(qGroup), .(AVE=eval(qAction), COUNT=.N), by = strVarGroup]

    print(sprintf("nVaGroup=%s, nVarMeans=%s: ", strVarGroup, strVarMeans))
    print(p)
  }

我的第一个问题:

代码在启用所需的功能/循环需求的同时,显得非常复杂。 - 它使用不同的多个(可能不一致的)非直觉技巧,例如() get() quote() / eval() [[]] )。似乎太多了这么简单的需要...

My first question:
The code above, while enabling the required functional/looping needs, appears quite convoluted. - It uses different multiple (possibly non-consistent) non-intuitive tricks such combination of (), get(), quote()/eval(), [[]]). Seems too many for a such straightforward need...

是否有更好的方式访问和修改数据循环中的data.tables值?也许与 on = lapply / .SD / .SDcols

Is there another better way of accessing and modifying data.tables values in loops? Perhaps with on=, lapply/.SD/.SDcols?

请在下面分享您的想法。这个讨论旨在补充和整合其他帖子中的相关位(例如:如何使用变量中的列名称,在R中的data.table中完全一致地工作)。最终,在函数 data.table >环。

Please share your ideas below. This discussion aims to supplement and consolidate related bits from other posts (such as listed here: How can one work fully generically in data.table in R with column names in variables). Eventually, it would be great to create a dedicated vignette for using data.table within functions and loops.

第二个问题:

为了这个目的,dplyr更容易吗? - 对于这个问题,我已经设置单独的帖子:在函数中使用的数据容易比data.table容易和循环?

The second question:
Is dplyr easier for this purpose? - For this question however, I've set a separate post: Is dplyr easier than data.table to be used within functions and loops?.

推荐答案

这可能不是最多的 data.table - 或最快的解决方案,但我会简化这个特定循环中的代码,如下所示:

This might not be the most data.table-like or the fastest solution but I would streamline the code in this particular loop as follows:

for(nVarGroup in 2:4) {      # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions <- levels(dt[[nVarGroup]])[-1] 
    strVarGroup <- names(dt)[nVarGroup]
    strVarMeans <- names(dt)[nVarMeans]
    # qAction <- quote(mean(get(strVarMeans)))
    # qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
    # p <- dt[eval(qGroup), .(AVE = eval(qAction), COUNT = .N), by = strVarGroup]
    setkeyv(dt, strVarGroup)
    p <- dt[strGroupConditions, .(AVE = lapply(.SD, mean), COUNT = .N), by = strVarGroup, 
            .SDcols = strVarMeans]

    print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
    print(p)
  }
}

我已经把旧的代码作为注释作为参考。

I've left the old code as comment for reference.

qAction 替换为使用 lapply(.SD,mean)连同 .SDcols 参数。

qAction is replaced by using lapply(.SD, mean) together with the .SDcols parameter.

qGroup 用于子集化行被替换为设置键并将所需值的向量提供为 i 参数。

qGroup for subsetting rows is replaced by the combination of setting a key and providing the vector of desired values as i parameter.

如果更复杂的子集表达式,我将尝试使用非等价(或条件)连接使用 on = 语法。

In case of a more complex subsetting expression I would try use non-equi (or conditional) joins using the on= syntax.

或者,按照 Matt Dowle的建议创建一个要评估的表达式,类似于构建要发送到服务器的动态SQL语句。 Matt建议创建一个助手功能

Or, follow Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server".



Matt suggested to create a helper function

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))

可以从 gsubfn fn $ 的准 - perl类型字符串插值 $ c>包以提高EVAL解决方案的可读性as G.GGthendieck建议

which can be combined with the "quasi-perl type string interpolation of fn$ from the gsubfn package to improve the readability of the EVAL solution" as suggested by G. Grothendieck.

为此,循环的代码最终将成为:

With this, the code for the loop becomes eventually:

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
library(gsubfn)

for(nVarGroup in 2:4) {      # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions = levels(dt[[nVarGroup]])[-1] 
    strVarGroup = names(dt)[nVarGroup]
    strVarMeans = names(dt)[nVarMeans]
    p <- fn$EVAL("dt[$strVarGroup %in% strGroupConditions, .(AVE=mean($strVarMeans), COUNT=.N), by = strVarGroup]" )

    print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
    print(p)
  }
}

现在, data.table 语句看起来非常像一个native语句,除了 $ strVarGroup $ strVarMeans 用于引用变量的内容。

Now, the data.table statement looks pretty much like a "native" statement except that $strVarGroup and $strVarMeans is used where the contents of variables is referenced.

与版本1.1.0(在08年8月的CRAN版本), stringr package已经获得了一个字符串插值函数 str_interp()这是 gsubfn

With version 1.1.0 (CRAN release on 2016-08-19), the stringr package has gained a string interpolation function str_interp() which is an alternative to the gsubfn package here.

对于 str_interp(),for循环中的中心语句将成为

With str_interp(), the central statement in the for loop would become

p <- EVAL(stringr::str_interp(
  "dt[${strVarGroup} %in% strGroupConditions, .(AVE=mean(${strVarMeans}), COUNT=.N), by = strVarGroup]"
  ))

,可以删除对库(gsubfn)的调用。

这篇关于如何在函数和循环中使用data.table?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆