从data.table的函数创建一个表达式到eval [英] create an expression from a function for data.table to eval

查看:240
本文介绍了从data.table的函数创建一个表达式到eval的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定 data.table dat:

  dat< -  data.table(x_one = 1:10,x_two = 1:10,y_one = 1:10,y_two = 1:10)

$ b b

我想要一个函数,在给定root名称的两行之间创建一个表达式,例如 x_one - x_two

  myfun < 
one< - paste0(name,'_one')
two< - paste0(name,'_two')

parse(text = ',two))
}

现在,只使用一个根名称导致向量。

  dat [,eval(myfun('x')),] 
$ b b [1] 0 0 0 0 0 0 0 0 0 0

使用列表技术的名称失败:

  dat [,list x_out = eval(myfun('x'))),] 

eval(expr,envir,enclos)中出错:未找到对象'x_one'

我可以通过添加一个(dat,...)几乎不会出现data.table-ish

  dat [,list(x_out = with(dat,eval(myfun('x' )),
y_out = with(dat,eval(myfun('y'))),]

x_out y_out
1:0 0
2: 0 0
3:0 0
4:0 0
5:0 0
6:0 0
7:0 0
8:0 0
9:0 0
10:0 0



如果有帮助, sessionInfo()输出低于。我记得能够做到这一点,或者接近它,但它已经一段时间, data.table 更新了...

  R版本2.15.1(2012-06-22)

平台:x86_64-pc-linux-gnu b
$ b locale:
[1] LC_CTYPE = en_US.UTF-8 LC_NUMERIC = C LC_TIME = en_US.UTF-8 LC_COLLATE = en_US.UTF-8
[5] LC_MONETARY = en_US .UTF-8 LC_MESSAGES = en_US.UTF-8 LC_PAPER = C LC_NAME = C
[9] LC_ADDRESS = C LC_TELEPHONE = C LC_MEASUREMENT = zh_US.UTF-8 LC_IDENTIFICATION = C

package:
[1] graphics grDevices utils datasets stats grid methods base

其他附加包:
[1] Cairo_1.5-1 zoo_1.7-7 stringr_0.6.1 doMC_1 .2.5 multicore_0.1-7 iterators_1.0.6 foreach_1.4.0
[8] data.table_1.8.2 circular_0.4-3 boot_1.3-5 ggplot2_0.9.1 reshape2_1.2.1 plyr_1.7.1

通过命名空间加载(并未附加):
[1] codetools_0.2-8 colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2 labeling_0.1 lattice_0.20-6
[ 7] MASS_7.3-20 memoise_0.1 munsell_0.3 proto_0.3-9.2 RColorBrewer_1.0-5 scales_0.2.1
[13] tools_2.15.1
列表(...) 在函数输出中。



我倾向于使用 as.quoted ,从@hadley实现<包中的。()。

  library(data.table)
library(plyr)
dat< - data.table(x_one = 1:10,x_two = 1:10,y_one = y_two = 1:10)
myfun < - function(name){
one< - paste0(name,'_one')
two< - paste0(name, )
out < - paste0(name,'_ out')
as.quoted(paste('list(',out,'=',one,' - ',two,')') )[[1]]
}


dat [,eval(myfun('x')),]

#x_out
#1:0
#2:0
#3:0
#4:0
#5:0
#6:0
# 7:0
#8:0
#9:0
#10:0


b $ b

一次可以调整您的通话

  myfun<  -  function(name){ 
one< - paste0(name,'_one')
two< - paste0(name,'_two')
out< - paste0(name,'_ out')
call< - 粘贴(粘贴(输出,'=',一个,' - ',两个),collapse =',')


as.quoted list(',calls,')'))[[1]]
}


dat [,eval(myfun(c('x','y') )),]

#x_out y_out
#1:0 0
#2:0 0
#3:0 0
#4:0 0
#5:0 0
#6:0 0
#7:0 0
#8:0 0
#9:0 0
#0:0 0

至于原因.....



在这个解决方案中,对' list(..)的整个调用在作为data.table的parent.frame中进行计算。



[。data.table 中的相关代码为

  if(missing(j))stop(logical error,j missing)
jsub = substitute(j)
if(is.null ))return(NULL)
jsubl = as.list.default(jsub)
if(identical(jsubl [[1L]],quote(eval))){
jsub = eval
if(is.expression(jsub))jsub = jsub [[1L]]
}

if(在您的情况下)

  j = list(xout = eval(myfun('x')))

## then

jsub< - substitute(j)



 #list(xout = eval(myfun(x)))

  as.list.default(jsub)
## [[1]]
## list
##
## $ xout
## eval(myfun(x))

code> jsubl [[1L]] 是列表 jsubl [[2L]] eval(myfun(x))



因此未找到对 eval 的调用,因此不会正确处理它。



这将工作,强制正确的data.table中的第二次评估

 #using OP myfun 
dat [,list(xout = eval(myfun('x'),dat))]

同样的方式

  eval(parse(text ='x_one'),dat)
#[1] 1 2 3 4 5 6 7 8 9 10

p>

  eval(eval(parse(text ='x_one')),dat)
/ pre>



编辑10/4/13



虽然使用 .SD 作为环境可能更安全(但更慢),因为它将对 i 通过以及例如

  dat [ ,list(xout = eval(myfun('x'),.SD))] 






从马修编辑:



+10到上面。我不能自己更好地解释。再进一步,我有时做的是构建整个 data.table查询,然后 eval 。有时,它可能有点更强大的方式。我想到它喜欢SQL;即,我们经常构造一个发送到SQL服务器来执行的动态SQL语句。当你调试时,有时也更容易看看构造的查询,并在浏览器提示符下运行。但是,有时这样的查询会很长,因此将 eval 变成 i j 通过可以更有效地不重新计算其他组件。像往常一样,有很多方法来皮肤的猫。



考虑 eval 整个查询的微妙的原因包括:


  1. 分组速度快的一个原因是它检查 j 第一。如果它是一个列表,它删除名称,但记住他们。然后 eval 为每个组的未命名列表,然后在最终结果的末尾重新命名一次。其他方法可能很慢的一个原因是每次重新为每个组重新创建相同的列名向量。更复杂的 j 是定义的(例如,如果表达式没有准确地开始 list ),内部编码检查逻辑。在这方面有很多测试;例如,结合 eval ,以及如果名称丢弃不起作用的冗长报告。但是,构造一个简单查询(完整查询)和 eval ,因为这个原因可能会更快更强大。


  2. 使用v1.8.2现在优化 j 选项(datatable.optimize = Inf)。这将检查 j 并修改它以优化 mean lapply(.SD,.. 。) idiom,到目前为止。这使得用户需要知道的数量级差异和意味着更少(例如,现在几个wiki点已经消失)。我们可以做更多的这一点;例如 DT [a == 10] 可以优化为 DT [J(10)] c $ c> key(DT)[1] ==a [Update Sep 2014 - now implemented in v1.9.3]。但是,如果不是 DT [,mean(a),by = b] ,它的 DT [ list(x = eval(expr)),by = b] 其中 expr 包含对 c $ c>。因此, eval 整个查询可以用 datatable.optimize 更好。如果需要,可以关闭报告中正在进行的操作和优化;例如,以测试其所产生的速度差。


根据注释,FR#2183已添加:更改j = list(xout = eval 。))的eval到在范围内的eval。感谢您的高亮。这是复杂的 j 我的意思是 eval 嵌套在表达式中。如果 j 开始 eval ,但是,更简单和已经编码



如果有一个外卖,那么它是:do use DT [...] ,verbose = TRUE] 选项(datatable.verbose = TRUE)以检查 data.table 在用于涉及 eval 的动态查询时仍然有效工作。


Given the data.table dat:

dat <- data.table(x_one=1:10, x_two=1:10, y_one=1:10, y_two=1:10) 

I'd like a function that creates an expression between two like rows given their "root" name, e.g. x_one - x_two.

myfun <- function(name) {
  one <- paste0(name, '_one')
  two <- paste0(name, '_two')

  parse(text=paste(one, '-', two))
}

Now, using just one root name works as expected and results in a vector.

dat[, eval(myfun('x')),]

[1] 0 0 0 0 0 0 0 0 0 0

However, trying to assign that output a name using the list technique fails:

dat[, list(x_out = eval(myfun('x'))),]

Error in eval(expr, envir, enclos) : object 'x_one' not found

I can "solve" this by adding a with(dat, ...) but that hardly seems data.table-ish

dat[, list(x_out = with(dat, eval(myfun('x'))),
           y_out = with(dat, eval(myfun('y')))),]

    x_out y_out
 1:     0     0
 2:     0     0
 3:     0     0
 4:     0     0
 5:     0     0
 6:     0     0
 7:     0     0
 8:     0     0
 9:     0     0
10:     0     0

What is the proper way to generate and evaluate these expressions if I want an output like I have above?

In case it helps, sessionInfo() output is below. I recall being able to do this, or something close to it, but its been awhile and data.table is updated since...

R version 2.15.1 (2012-06-22)

Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] graphics  grDevices utils     datasets  stats     grid      methods   base     

other attached packages:
 [1] Cairo_1.5-1      zoo_1.7-7        stringr_0.6.1    doMC_1.2.5       multicore_0.1-7  iterators_1.0.6  foreach_1.4.0   
 [8] data.table_1.8.2 circular_0.4-3   boot_1.3-5       ggplot2_0.9.1    reshape2_1.2.1   plyr_1.7.1      

loaded via a namespace (and not attached):
 [1] codetools_0.2-8    colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       labeling_0.1       lattice_0.20-6    
 [7] MASS_7.3-20        memoise_0.1        munsell_0.3        proto_0.3-9.2      RColorBrewer_1.0-5 scales_0.2.1      
[13] tools_2.15.1      

解决方案

One solution is to put the list(...) within the function output.

I tend to use as.quoted, stealing from the way @hadley implements .() in the plyr package.

library(data.table)
library(plyr)
dat <- data.table(x_one=1:10, x_two=1:10, y_one=1:10, y_two=1:10) 
myfun <- function(name) {
  one <- paste0(name, '_one')
  two <- paste0(name, '_two')
  out <- paste0(name,'_out')
 as.quoted(paste('list(',out, '=',one, '-', two,')'))[[1]]
}


dat[, eval(myfun('x')),]

#    x_out
# 1:     0
# 2:     0
# 3:     0
# 4:     0
# 5:     0
# 6:     0
# 7:     0
# 8:     0
# 9:     0
#10:     0

To do two columns at once you can adjust your call

myfun <- function(name) {
  one <- paste0(name, '_one')
  two <- paste0(name, '_two')
  out <- paste0(name,'_out')
  calls <- paste(paste(out, '=', one, '-',two), collapse = ',')


 as.quoted(paste('list(', calls,')'))[[1]]
}


dat[, eval(myfun(c('x','y'))),]

#   x_out y_out
# 1:     0     0
# 2:     0     0
# 3:     0     0
# 4:     0     0
# 5:     0     0
# 6:     0     0
# 7:     0     0
# 8:     0     0
# 9:     0     0
# 0:     0     0

As for the reason.....

in this solution the entire call to 'list(..) is evaluated within the parent.frame being the data.table.

The relevant code within [.data.table is

if (missing(j)) stop("logical error, j missing")
jsub = substitute(j)
if (is.null(jsub)) return(NULL)
jsubl = as.list.default(jsub)
if (identical(jsubl[[1L]],quote(eval))) {
    jsub = eval(jsubl[[2L]],parent.frame())
    if (is.expression(jsub)) jsub = jsub[[1L]]
}

if (in your case)

j = list(xout = eval(myfun('x'))) 

##then

jsub <- substitute(j) 

is

 #  list(xout = eval(myfun("x")))

and

as.list.default(jsub)
## [[1]]
## list
## 
## $xout
## eval(myfun("x"))

so jsubl[[1L]] is list, jsubl[[2L]] is eval(myfun("x"))

so data.table has not found a call to evaland will not deal with it appropriately.

This will work, forcing the second evaluation within correct data.table

# using OP myfun
dat[,list(xout =eval(myfun('x'), dat))]

The same way

eval(parse(text = 'x_one'),dat)
# [1]  1  2  3  4  5  6  7  8  9 10

Works but

 eval(eval(parse(text = 'x_one')), dat)

Does not

Edit 10/4/13

Although it is probably safer (but slower) to use .SD as the environment, as it will then be robust to i or by as well eg

dat[,list(xout =eval(myfun('x'), .SD))]


Edit from Matthew :

+10 to above. I couldn't have explained it better myself. Taking it a step further, what I sometimes do is construct the entire data.table query and then eval that. It can be a bit more robust that way, sometimes. I think of it like SQL; i.e, we often construct a dynamic SQL statement that is sent to the SQL server to be executed. When you are debugging, too, it's also sometimes easier to look at the constructed query and run that at the browser prompt. But, sometimes such a query would be very long, so passing eval into i,j or by can be more efficient by not recomputing the other components. As usual, many ways to skin the cat.

The subtle reasons for considering evaling the entire query include :

  1. One reason grouping is fast is that it inspects the j expression first. If it's a list, it removes the names, but remembers them. It then evals an unnamed list for each group, then reinstates the names once, at the end on the final result. One reason other methods can be slow is the recreation of the same column name vector for each and every group, over and over again. The more complex j is defined though (e.g. if the expression doesn't start precisely with list), the harder it gets to code up the inspection logic internally. There are lots of tests in this area; e.g., in combination with eval, and verbosity reports if name dropping isn't working. But, constructing a "simple" query (the full query) and evaling that may be faster and more robust for this reason.

  2. With v1.8.2 there's now optimization of j: options(datatable.optimize=Inf). This inspects j and modifies it to optimize mean and the lapply(.SD,...) idiom, so far. This makes orders of magnitude difference and means theres less for the user to need to know (e.g. a few of the wiki points have gone away now). We could do more of this; e.g., DT[a==10] could be optimized to DT[J(10)] automatically if key(DT)[1]=="a" [Update Sep 2014 - now implemented in v1.9.3]. But again, the internal optimizations get harder to code up internally if rather than DT[,mean(a),by=b] it's DT[,list(x=eval(expr)),by=b] where expr contained a call to mean, for example. So evaling the entire query may play nicer with datatable.optimize. Turning verbosity on reports what it's doing and optimization can be turned off if needed; e.g., to test the speed difference it makes.

As per comments, FR#2183 has been added: "Change j=list(xout=eval(...))'s eval to eval within scope of DT". Thanks for highlighting. That's the sort of complex j I mean where the eval is nested in the expression. If j starts with eval, though, that's much simpler and already coded (as shown above) and tested, and should be optimized fine.

If there's one take-away from this then it's: do use DT[...,verbose=TRUE] or options(datatable.verbose=TRUE) to check data.table is still working efficiently when used for dynamic queries involving eval.

这篇关于从data.table的函数创建一个表达式到eval的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆