如何在函数和循环中使用data.table？ [英] How to use data.table within functions and loops?

查看：127 发布时间：2017/7/13 20:12:05 r function loops data.table dplyr

本文介绍了如何在函数和循环中使用data.table？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在评估 data.table （与 dplyr ）的效用时，一个关键因素是在函数和循环中使用它。

为此，我修改了这篇文章中使用的代码片段： data.table vs dplyr：可以做一些不好的事情吗？ >，而不是硬编码的数据集变量名称（切割和价格变量的钻石数据集），它成为数据集 - 不可知 - 剪切 - 粘贴准备在任何函数或循环内使用当我们不提前知道列名时）。

While assessing the utility of data.table (vs. dplyr), a critical factor is the ability to use it within functions and loops.
For this, I've modified the code snippet used in this post: data.table vs dplyr: can one do something well the other can't or does poorly? so that, instead of hard-coded dataset variables names ("cut" and "price" variables of "diamonds" dataset), it becomes dataset-agnostic - cut-n-paste ready for the use inside any function or a loop (when we don't know column names in advance).

这是原始代码：

library(data.table)
dt <- data.table(ggplot2::diamonds)
dt[cut != "Fair", .(mean(price),.N), by = cut]

这是其数据集不可知的等价物：

This is its dataset-agnostic equivalent:

dt <- data.table(diamonds)
nVarGroup <- 2 #"cut"
nVarMeans <- 7 #"price"

strGroupConditions <- levels(dt[[nVarGroup]])[-1] # "Good" "Very Good" "Premium" "Ideal" 
strVarGroup <- names(dt)[nVarGroup]
strVarMeans <- names(dt)[nVarMeans]
qAction <- quote(mean(get(strVarMeans))) #! w/o get() it does not work! 
qGroup <- quote(get(strVarGroup) %in% strGroupConditions) #! w/o get() it does not work! 
dt[eval(qGroup), .(eval(qAction), .N), by = strVarGroup]

注意（感谢下面的回复）：如果您需要通过引用更改变量值，则需要使用（）而不是 get（），如下所示：

Note (Thanks to reply below): if you need to change variable value by reference, you need to use (), not get(), as shown below:

strVarToBeReplaced <- names(dt)[1]
dt[eval(qGroup), (strVarToBeReplaced) := eval(qAction), by = strGroup][]

现在，您可以为以下所有循环需求剪切以下代码：

Now: you can cut-n-paste the following code for all your looping needs:

for(nVarGroup in 2:4)       # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions <- levels(dt[[nVarGroup]])[-1] 
    strVarGroup <- names(dt)[nVarGroup]
    strVarMeans <- names(dt)[nVarMeans]
    qAction  <- quote(mean(get(strVarMeans))) 
    qGroup <- quote(get(strVarGroup) %in% strGroupConditions) 
    p <- dt[eval(qGroup), .(AVE=eval(qAction), COUNT=.N), by = strVarGroup]

    print(sprintf("nVaGroup=%s, nVarMeans=%s: ", strVarGroup, strVarMeans))
    print(p)
  }

我的第一个问题：

代码在启用所需的功能/循环需求的同时，显得非常复杂。 - 它使用不同的多个（可能不一致的）非直觉技巧，例如（）， get（） ， quote（） / eval（）， [[]] ）。似乎太多了这么简单的需要...

My first question:
The code above, while enabling the required functional/looping needs, appears quite convoluted. - It uses different multiple (possibly non-consistent) non-intuitive tricks such combination of (), get(), quote()/eval(), [[]]). Seems too many for a such straightforward need...

是否有更好的方式访问和修改数据循环中的data.tables值？也许与 on = ， lapply / .SD / .SDcols ？

Is there another better way of accessing and modifying data.tables values in loops? Perhaps with on=, lapply/.SD/.SDcols?

请在下面分享您的想法。这个讨论旨在补充和整合其他帖子中的相关位（例如：如何使用变量中的列名称，在R中的data.table中完全一致地工作）。最终，在函数和 data.table >环。

Please share your ideas below. This discussion aims to supplement and consolidate related bits from other posts (such as listed here: How can one work fully generically in data.table in R with column names in variables). Eventually, it would be great to create a dedicated vignette for using data.table within functions and loops.

第二个问题：

为了这个目的，dplyr更容易吗？ - 对于这个问题，我已经设置单独的帖子：在函数中使用的数据容易比data.table容易和循环？。

The second question:
Is dplyr easier for this purpose? - For this question however, I've set a separate post: Is dplyr easier than data.table to be used within functions and loops?.

推荐答案

这可能不是最多的 data.table - 或最快的解决方案，但我会简化这个特定循环中的代码，如下所示：


This might not be the most data.table-like or the fastest solution but I would streamline the code in this particular loop as follows:
for(nVarGroup in 2:4) {      # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions <- levels(dt[[nVarGroup]])[-1] 
    strVarGroup <- names(dt)[nVarGroup]
    strVarMeans <- names(dt)[nVarMeans]
    # qAction <- quote(mean(get(strVarMeans)))
    # qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
    # p <- dt[eval(qGroup), .(AVE = eval(qAction), COUNT = .N), by = strVarGroup]
    setkeyv(dt, strVarGroup)
    p <- dt[strGroupConditions, .(AVE = lapply(.SD, mean), COUNT = .N), by = strVarGroup, 
            .SDcols = strVarMeans]

    print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
    print(p)
  }
}

我已经把旧的代码作为注释作为参考。
I've left the old code as comment for reference.
  qAction 替换为使用 lapply（.SD，mean）连同 .SDcols 参数。 
qAction is replaced by using lapply(.SD, mean) together with the .SDcols parameter.
  qGroup 用于子集化行被替换为设置键并将所需值的向量提供为 i 参数。 
qGroup for subsetting rows is replaced by the combination of setting a key and providing the vector of desired values as i parameter. 
如果更复杂的子集表达式，我将尝试使用非等价（或条件）连接使用 on = 语法。 
In case of a more complex subsetting expression I would try use non-equi (or conditional) joins using the on= syntax. 
或者，按照 Matt Dowle的建议创建一个要评估的表达式，类似于构建要发送到服务器的动态SQL语句。 Matt建议创建一个助手功能
Or, follow Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server". 
 
 
 
Matt suggested to create a helper function
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))

可以从 gsubfn  fn $ 的准 -  perl类型字符串插值 $ c>包以提高EVAL解决方案的可读性as  G.GGthendieck建议。 
which can be combined with the "quasi-perl type string interpolation of fn$ from the gsubfn package to improve the readability of the EVAL solution" as suggested by G. Grothendieck. 
为此，循环的代码最终将成为：
With this, the code for the loop becomes eventually:
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
library(gsubfn)

for(nVarGroup in 2:4) {      # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions = levels(dt[[nVarGroup]])[-1] 
    strVarGroup = names(dt)[nVarGroup]
    strVarMeans = names(dt)[nVarMeans]
    p <- fn$EVAL("dt[$strVarGroup %in% strGroupConditions, .(AVE=mean($strVarMeans), COUNT=.N), by = strVarGroup]" )

    print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
    print(p)
  }
}

现在， data.table 语句看起来非常像一个native语句，除了 $ strVarGroup 和 $ strVarMeans 用于引用变量的内容。 
Now, the data.table statement looks pretty much like a "native" statement except that $strVarGroup and $strVarMeans is used where the contents of variables is referenced.
与版本1.1.0（在08年8月的CRAN版本）， stringr  package已经获得了一个字符串插值函数 str_interp（）这是 gsubfn  
With version 1.1.0 (CRAN release on 2016-08-19), the stringr package has gained a string interpolation function str_interp() which is an alternative to the gsubfn package here.
对于 str_interp（），for循环中的中心语句将成为
With str_interp(), the central statement in the for loop would become
p <- EVAL(stringr::str_interp(
  "dt[${strVarGroup} %in% strGroupConditions, .(AVE=mean(${strVarMeans}), COUNT=.N), by = strVarGroup]"
  ))

，可以删除对库（gsubfn）的调用。

                        这篇关于如何在函数和循环中使用data.table？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何在函数和循环中使用data.table？ [英] How to use data.table within functions and loops?

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

如何在函数和循环中使用data.table？ [英] How to use data.table within functions and loops?

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭