从 data.table 到 eval 的函数创建表达式 [英] create an expression from a function for data.table to eval

查看:13
本文介绍了从 data.table 到 eval 的函数创建表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定 data.table 数据:

dat <- data.table(x_one=1:10, x_two=1:10, y_one=1:10, y_two=1:10) 

我想要一个在给定根"名称的情况下在两个相似行之间创建表达式的函数,例如x_one - x_two.

I'd like a function that creates an expression between two like rows given their "root" name, e.g. x_one - x_two.

myfun <- function(name) {
  one <- paste0(name, '_one')
  two <- paste0(name, '_two')

  parse(text=paste(one, '-', two))
}

现在,只使用一个根名称可以按预期工作,并产生一个向量.

Now, using just one root name works as expected and results in a vector.

dat[, eval(myfun('x')),]

[1] 0 0 0 0 0 0 0 0 0 0

但是,尝试使用 list 技术为该输出分配名称失败:

However, trying to assign that output a name using the list technique fails:

dat[, list(x_out = eval(myfun('x'))),]

Error in eval(expr, envir, enclos) : object 'x_one' not found

我可以通过添加一个 with(dat, ...) 来解决"这个问题,但这似乎不太像 data.table-ish

I can "solve" this by adding a with(dat, ...) but that hardly seems data.table-ish

dat[, list(x_out = with(dat, eval(myfun('x'))),
           y_out = with(dat, eval(myfun('y')))),]

    x_out y_out
 1:     0     0
 2:     0     0
 3:     0     0
 4:     0     0
 5:     0     0
 6:     0     0
 7:     0     0
 8:     0     0
 9:     0     0
10:     0     0

如果我想要像上面那样的输出,生成和评估这些表达式的正确方法是什么?

What is the proper way to generate and evaluate these expressions if I want an output like I have above?

如果有帮助,sessionInfo() 输出如下.我记得能够做到这一点,或者接近它的东西,但它已经有一段时间了,并且 data.table 已经更新了......

In case it helps, sessionInfo() output is below. I recall being able to do this, or something close to it, but its been awhile and data.table is updated since...

R version 2.15.1 (2012-06-22)

Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] graphics  grDevices utils     datasets  stats     grid      methods   base     

other attached packages:
 [1] Cairo_1.5-1      zoo_1.7-7        stringr_0.6.1    doMC_1.2.5       multicore_0.1-7  iterators_1.0.6  foreach_1.4.0   
 [8] data.table_1.8.2 circular_0.4-3   boot_1.3-5       ggplot2_0.9.1    reshape2_1.2.1   plyr_1.7.1      

loaded via a namespace (and not attached):
 [1] codetools_0.2-8    colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       labeling_0.1       lattice_0.20-6    
 [7] MASS_7.3-20        memoise_0.1        munsell_0.3        proto_0.3-9.2      RColorBrewer_1.0-5 scales_0.2.1      
[13] tools_2.15.1      

推荐答案

一种解决方案是将 list(...) 放在函数输出中.

One solution is to put the list(...) within the function output.

我倾向于使用 as.quoted,借鉴了@hadley 在 plyr 包中实现 .() 的方式.

I tend to use as.quoted, stealing from the way @hadley implements .() in the plyr package.

library(data.table)
library(plyr)
dat <- data.table(x_one=1:10, x_two=1:10, y_one=1:10, y_two=1:10) 
myfun <- function(name) {
  one <- paste0(name, '_one')
  two <- paste0(name, '_two')
  out <- paste0(name,'_out')
 as.quoted(paste('list(',out, '=',one, '-', two,')'))[[1]]
}


dat[, eval(myfun('x')),]

#    x_out
# 1:     0
# 2:     0
# 3:     0
# 4:     0
# 5:     0
# 6:     0
# 7:     0
# 8:     0
# 9:     0
#10:     0

要一次做两列,你可以调整你的调用

To do two columns at once you can adjust your call

myfun <- function(name) {
  one <- paste0(name, '_one')
  two <- paste0(name, '_two')
  out <- paste0(name,'_out')
  calls <- paste(paste(out, '=', one, '-',two), collapse = ',')


 as.quoted(paste('list(', calls,')'))[[1]]
}


dat[, eval(myfun(c('x','y'))),]

#   x_out y_out
# 1:     0     0
# 2:     0     0
# 3:     0     0
# 4:     0     0
# 5:     0     0
# 6:     0     0
# 7:     0     0
# 8:     0     0
# 9:     0     0
# 0:     0     0

至于原因.....

在此解决方案中,对 'list(..) 的整个调用都在作为 data.table 的 parent.frame 中进行评估.

in this solution the entire call to 'list(..) is evaluated within the parent.frame being the data.table.

[.data.table里面的相关代码是

if (missing(j)) stop("logical error, j missing")
jsub = substitute(j)
if (is.null(jsub)) return(NULL)
jsubl = as.list.default(jsub)
if (identical(jsubl[[1L]],quote(eval))) {
    jsub = eval(jsubl[[2L]],parent.frame())
    if (is.expression(jsub)) jsub = jsub[[1L]]
}

如果(在你的情况下)

j = list(xout = eval(myfun('x'))) 

##then

jsub <- substitute(j) 

 #  list(xout = eval(myfun("x")))

as.list.default(jsub)
## [[1]]
## list
## 
## $xout
## eval(myfun("x"))

所以 jsubl[[1L]]listjsubl[[2L]]eval(myfun("x"))

所以 data.table 没有找到对 eval 的调用并且不会适当地处理它.

so data.table has not found a call to evaland will not deal with it appropriately.

这将起作用,强制在正确的 data.table 中进行第二次评估

This will work, forcing the second evaluation within correct data.table

# using OP myfun
dat[,list(xout =eval(myfun('x'), dat))]

同样的方法

eval(parse(text = 'x_one'),dat)
# [1]  1  2  3  4  5  6  7  8  9 10

工作但

 eval(eval(parse(text = 'x_one')), dat)

没有

虽然使用 .SD 作为环境可能更安全(但速度较慢),因为它对 iby 以及例如

Although it is probably safer (but slower) to use .SD as the environment, as it will then be robust to i or by as well eg

dat[,list(xout =eval(myfun('x'), .SD))]

<小时>

马修

+10 以上.我自己无法更好地解释它.更进一步,我有时会构建 entire data.table 查询,然后构建 eval .有时,这种方式可能会更健壮一些.我把它想象成 SQL;即,我们经常构造一个动态的 SQL 语句,发送到 SQL 服务器执行.当您也在调试时,有时也更容易查看构造的查询并在浏览器提示符下运行它.但是,有时这样的查询会很长,因此将 eval 传递到 ijby 可以通过不重新计算其他组件来提高效率.像往常一样,有很多方法可以给猫剥皮.

+10 to above. I couldn't have explained it better myself. Taking it a step further, what I sometimes do is construct the entire data.table query and then eval that. It can be a bit more robust that way, sometimes. I think of it like SQL; i.e, we often construct a dynamic SQL statement that is sent to the SQL server to be executed. When you are debugging, too, it's also sometimes easier to look at the constructed query and run that at the browser prompt. But, sometimes such a query would be very long, so passing eval into i,j or by can be more efficient by not recomputing the other components. As usual, many ways to skin the cat.

考虑eval整个查询的微妙原因包括:

The subtle reasons for considering evaling the entire query include :

  1. 分组速度很快的一个原因是它首先检查 j 表达式.如果它是 list,它会删除名称,但会记住它们.然后它eval为每个组创建一个未命名列表,然后在最终结果的末尾恢复名称一次.其他方法可能很慢的一个原因是一遍又一遍地为每个组重新创建相同的列名向量.但是,定义的 j 越复杂(例如,如果表达式不是以 list 精确开始),内部编写检查逻辑的难度就越大.这方面有很多测试;例如,与 eval 结合使用,如果名称删除不起作用,则会报告详细程度.但是,由于这个原因,构建一个简单"查询(完整查询)和 evaling 可能更快、更健壮.

  1. One reason grouping is fast is that it inspects the j expression first. If it's a list, it removes the names, but remembers them. It then evals an unnamed list for each group, then reinstates the names once, at the end on the final result. One reason other methods can be slow is the recreation of the same column name vector for each and every group, over and over again. The more complex j is defined though (e.g. if the expression doesn't start precisely with list), the harder it gets to code up the inspection logic internally. There are lots of tests in this area; e.g., in combination with eval, and verbosity reports if name dropping isn't working. But, constructing a "simple" query (the full query) and evaling that may be faster and more robust for this reason.

在 v1.8.2 中,现在优化了 j:options(datatable.optimize=Inf).到目前为止,这会检查 j 并对其进行修改以优化 meanlapply(.SD,...) 习语.这产生了数量级的差异,意味着用户需要知道的东西更少(例如,一些 wiki 点现在已经消失了).我们可以做更多这样的事情;例如,DT[a==10] 可以自动优化为 DT[J(10)] 如果 key(DT)[1]=="a" [2014 年 9 月更新 - 现在在 v1.9.3 中实施].但同样,如果 DT[,mean(a),by=b]DT[,list(x=eval(expr)),by=b] 例如,其中 expr 包含对 mean 的调用.所以 eval 对整个查询进行 datatable.optimize 可能会更好.打开详细报告它正在做什么,如果需要可以关闭优化;例如,测试它产生的速度差异.

With v1.8.2 there's now optimization of j: options(datatable.optimize=Inf). This inspects j and modifies it to optimize mean and the lapply(.SD,...) idiom, so far. This makes orders of magnitude difference and means theres less for the user to need to know (e.g. a few of the wiki points have gone away now). We could do more of this; e.g., DT[a==10] could be optimized to DT[J(10)] automatically if key(DT)[1]=="a" [Update Sep 2014 - now implemented in v1.9.3]. But again, the internal optimizations get harder to code up internally if rather than DT[,mean(a),by=b] it's DT[,list(x=eval(expr)),by=b] where expr contained a call to mean, for example. So evaling the entire query may play nicer with datatable.optimize. Turning verbosity on reports what it's doing and optimization can be turned off if needed; e.g., to test the speed difference it makes.

根据评论,已添加 FR#2183:将 j=list(xout=eval(...)) 的 eval 更改为 DT 范围内的 eval".感谢您的强调.这就是复杂的 j 我的意思是 eval 嵌套在表达式中的位置.但是,如果 j startseval 开头,那会简单得多,并且已经编码(如上所示)并经过测试,应该可以优化.

As per comments, FR#2183 has been added: "Change j=list(xout=eval(...))'s eval to eval within scope of DT". Thanks for highlighting. That's the sort of complex j I mean where the eval is nested in the expression. If j starts with eval, though, that's much simpler and already coded (as shown above) and tested, and should be optimized fine.

如果有一个要点,那就是:使用 DT[...,verbose=TRUE]options(datatable.verbose=TRUE) 来检查 data.table 在用于涉及 eval 的动态查询时仍然有效.

If there's one take-away from this then it's: do use DT[...,verbose=TRUE] or options(datatable.verbose=TRUE) to check data.table is still working efficiently when used for dynamic queries involving eval.

这篇关于从 data.table 到 eval 的函数创建表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆