如何在函数和循环中使用data.table? [英] How to use data.table within functions and loops?

查看:180
本文介绍了如何在函数和循环中使用data.table?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在评估 data.table (与 dplyr )的效用时,关键因素是在函数和循环中使用它。

为此,我修改了此帖中使用的代码段: data.table vs dplyr:一个人能做得很好,另一个人能做得不好还是做得不好? >因此,不是硬编码的数据集变量名称(切割和价格变量的菱形数据集),它变成数据集不可知的切割n粘贴准备好在任何函数或循环当我们不知道列名称提前)。



这是原始代码:

 
dt = data.table(ggplot2 :: diamonds)
dt [cut!=Fair,。(mean(price),. N),by = cut]



这是它的数据集不可知的等价物:

  dt = data.table(diamonds)
nVarGroup = 2#cut
nVarMeans = 7#price

strGroupConditions = levels nVarGroup]])[ - 1]#GoodVery GoodPremiumIdeal
strVarGroup = names(dt)[nVarGroup]
strVarMeans = names(dt)[nVarMeans] $ b qAction = quote(mean(get(strVarMeans)))#! w / o get()它不工作!
qGroup = quote(get(strVarGroup)%in%strGroupConditions)#! w / o get()它不工作!
dt [eval(qGroup),。(eval(qAction),.N),by = strVarGroup]

注意(感谢下面的回复):如果你需要通过引用改变变量值,你需要使用(),而不是 get(),如下所示:

  strVarToBeReplaced = names $ b dt [eval(qGroup),(strVarToBeReplaced):= eval(qAction),by = strGroup] [] 


$ b b

现在:您可以为所有的循环需求剪切粘贴,如在这里:

  for nVarGroup in 2:4)#由几个分类值分组... 
for(nVarMeans in 5:10){#...获取所有数值参数的手段
strGroupConditions = levels(dt [[nVarGroup ]] [ - 1]
strVarGroup = names(dt)[nVarGroup]
strVarMeans = names(dt)[nVarMeans]
qAction = quote(mean(get(strVarMeans)))
qGroup = quote(get(strVarGroup)%in%strGroupConditions)
p = dt [eval(qGroup),。(AVE = eval(qAction),COUNT = .N),by = strVarGroup]

print(sprintf(nVaGroup =%s,nVarMeans =%s:,strVarGroup,strVarMeans))
print(p)
}

我的第一个问题:

上面的代码虽然启用了所需的功能/循环需求,但显得相当复杂。 - 它使用不同的多个(可能不一致的)非直观的技巧,例如() get() quote() / eval() [[]] )。似乎很多人为了这样直接的需要...



还有另一个更好的方式访问和修改data.tables值在循环吗? on = with = F lapply / .SD / .SDcols



请分享您的想法下面。此讨论旨在补充和整合来自其他帖子的相关位(如列出此处:如何可以在R中使用变量中的列名在data.table中完全通用)。最终,在函数中使用 data.table 创建一个专用的小插曲> loops



第二个问题:

为此目的,dplyr是否更容易?单独的帖子:
dplyr比在函数中使用data.table更容易

这可能不是最多的 data.table


解决方案 -like或最快的解决方案,但我会简化在这个特定循环中的代码如下:

 code> for(nVarGroup in 2:4){#分组由几个分类值... 
为(nVarMeans在5:10){#...获取所有数值参数的手段
strGroupConditions< - levels(dt [[nVarGroup]])[ - 1]
strVarGroup< - names(dt)[nVarGroup]
strVarMeans < - names(dt)[nVarMeans]
#qAction< - quote(mean(get(strVarMeans)))
#qGroup < - quote(get(strVarGroup)%in%strGroupConditions)
#p < ),。(AVE = eval(qAction),COUNT = .N),by = strVarGroup]
setkeyv(dt,strVarGroup)
p < - dt [strGroupConditions,。 ,mean),COUNT = .N),by = strVarGroup,
.SDcols = strVarMeans]

print(sprintf(nVaGroup =%s,nVarMeans =%s,strVarGroup,strVarMeans ))
print(p)
}
}

c>使用替换旧的代码作为注释。



参数



用于子集化行的code> qGroup
被替换为设置键和提供所需值的向量作为 i 参数的组合。






如果是更复杂的子集化表达式,我会尝试使用非等值c $ c> on = 语法。



或者,按照 Matt Dowles的建议创建一个表达式,类似于构造动态SQL语句以发送到服务器。



Matt建议创建助手函数

  EVAL< function(...)eval(parse(text = paste0(...)),envir = parent.frame(2))

$ b b

可以与 gsubfn 中的 fn $ 的quasi-perl类型字符串插值$ c> package以提高EVAL解决方案的可读性as 由G. Grothendieck建议



这样,循环的代码最终会变成:

  EVAL < -  function(...)eval(parse(text = paste0(...)),envir = parent.frame(2))
library(gsubfn)

(nVarGroup in 2:4){#由几个分类值分组...
for(nVarMeans in 5:10){#...获取所有数值参数的手段
strGroupConditions = levels [nVarGroup]])[ - 1]
strVarGroup = names(dt)[nVarGroup]
strVarMeans = names(dt)[nVarMeans]
p < - fn $ EVAL strVarGroup%%strGroupConditions,。(AVE = mean($ strVarMeans),COUNT = .N),by = strVarGroup])

print(sprintf(nVaGroup =%s,nVarMeans =%s ,strVarGroup,strVarMeans))
print(p)
}
}


$ b b

现在, data.table 语句看起来非常像native语句,除了 $ strVarGroup $ strVarMeans 用于引用变量的内容。






使用版本1.1.0(CRAN版本于2016-08-19), stringr 包已获得字符串插值函数 str_interp / code>这是 gsubfn 包的替代品。



使用 str_interp(),for循环中的中央语句将变为

  p < (string str :: str_interp(
dt [$ {strVarGroup}%in%strGroupConditions,。(AVE = mean($ {strVarMeans}),COUNT = .N),by = strVarGroup]
) )

并调用库(gsubfn)可以删除。


While assessing the utility of data.table (vs. dplyr), a critical factor is the ability to use it within functions and loops.
For this, I've modified the code snippet used in this post: data.table vs dplyr: can one do something well the other can't or does poorly? so that, instead of hard-coded dataset variables names ("cut" and "price" variables of "diamonds" dataset), it becomes dataset-agnostic - cut-n-paste ready for the use inside any function or a loop (when we don't know column names in advance).

This is the original code:

library(data.table)
dt = data.table(ggplot2::diamonds)
dt[cut != "Fair", .(mean(price),.N), by = cut]  

This is its dataset-agnostic equivalent:

dt = data.table(diamonds)
nVarGroup = 2 #"cut"
nVarMeans = 7 #"price"

strGroupConditions = levels(dt[[nVarGroup]])[-1] # "Good" "Very Good" "Premium" "Ideal" 
strVarGroup = names(dt)[nVarGroup]
strVarMeans = names(dt)[nVarMeans]
qAction=quote(mean(get(strVarMeans))) #! w/o get() it does not work! 
qGroup=quote(get(strVarGroup) %in% strGroupConditions) #! w/o get() it does not work! 
dt[eval(qGroup), .(eval(qAction), .N), by = strVarGroup]

Note (Thanks to reply below): if you need to change variable value by reference, you need to use (), not get(), as shown below:

strVarToBeReplaced = names(dt)[1]
dt[eval(qGroup), (strVarToBeReplaced) := eval(qAction), by = strGroup][] 

Now: you can cut-n-paste it for all your looping needs, as in here:

for(nVarGroup in 2:4)       # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions = levels(dt[[nVarGroup]])[-1] 
    strVarGroup = names(dt)[nVarGroup]
    strVarMeans = names(dt)[nVarMeans]
    qAction=quote(mean(get(strVarMeans))) 
    qGroup=quote(get(strVarGroup) %in% strGroupConditions) 
    p = dt[eval(qGroup), .(AVE=eval(qAction), COUNT=.N), by = strVarGroup]

    print(sprintf("nVaGroup=%s, nVarMeans=%s: ", strVarGroup, strVarMeans))
    print(p)
  }

My first question:
The code above, while enabling the required functional/looping needs, appears quite convoluted. - It uses different multiple (possibly non-consistent) non-intuitive tricks such combination of (), get(), quote()/eval(), [[]]). Seems to many for a such straightforward need...

Is there another better way of accessing and modifying data.tables values in loops? Perhaps with on=, with=F, lapply/.SD/.SDcols?

Please share your ideas below. This discussion aims to supplement and consolidate related bits from other posts (such as listed here: How can one work fully generically in data.table in R with column names in variables). Eventually, it would be great to create a dedicated vignette for using data.table within functions and loops.

The second question:
Is dplyr easier for this purpose? - For this question however, I've set a separate post: Is dplyr easier than data.table to be used within functions and loops?.

解决方案

This might not be the most data.table-like or the fastest solution but I would streamline the code in this particular loop as follows:

for(nVarGroup in 2:4) {      # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions <- levels(dt[[nVarGroup]])[-1] 
    strVarGroup <- names(dt)[nVarGroup]
    strVarMeans <- names(dt)[nVarMeans]
    # qAction <- quote(mean(get(strVarMeans)))
    # qGroup <- quote(get(strVarGroup) %in% strGroupConditions)
    # p <- dt[eval(qGroup), .(AVE = eval(qAction), COUNT = .N), by = strVarGroup]
    setkeyv(dt, strVarGroup)
    p <- dt[strGroupConditions, .(AVE = lapply(.SD, mean), COUNT = .N), by = strVarGroup, 
            .SDcols = strVarMeans]

    print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
    print(p)
  }
}

I've left the old code as comment for reference.

qAction is replaced by using lapply(.SD, mean) together with the .SDcols parameter.

qGroup for subsetting rows is replaced by the combination of setting a key and providing the vector of desired values as i parameter.


In case of a more complex subsetting expression I would try use non-equi (or conditional) joins using the on= syntax.

Or, follow Matt Dowles' advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server".

Matt suggested to create a helper function

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))

which can be combined with the "quasi-perl type string interpolation of fn$ from the gsubfn package to improve the readability of the EVAL solution" as suggested by G. Grothendieck.

With this, the code for the loop becomes eventually:

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
library(gsubfn)

for(nVarGroup in 2:4) {      # Grouped by several categorical values...
  for(nVarMeans in 5:10) {  # ... get means of all numerical parameters
    strGroupConditions = levels(dt[[nVarGroup]])[-1] 
    strVarGroup = names(dt)[nVarGroup]
    strVarMeans = names(dt)[nVarMeans]
    p <- fn$EVAL("dt[$strVarGroup %in% strGroupConditions, .(AVE=mean($strVarMeans), COUNT=.N), by = strVarGroup]" )

    print(sprintf("nVaGroup = %s, nVarMeans = %s", strVarGroup, strVarMeans))
    print(p)
  }
}

Now, the data.table statement looks pretty much like a "native" statement except that $strVarGroup and $strVarMeans is used where the contents of variables is referenced.


With version 1.1.0 (CRAN release on 2016-08-19), the stringr package has gained a string interpolation function str_interp() which is an alternative to the gsubfn package here.

With str_interp(), the central statement in the for loop would become

p <- EVAL(stringr::str_interp(
  "dt[${strVarGroup} %in% strGroupConditions, .(AVE=mean(${strVarMeans}), COUNT=.N), by = strVarGroup]"
  ))

and the call to library(gsubfn) could be removed.

这篇关于如何在函数和循环中使用data.table?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆