我应该在data.table中使用:=运算符? [英] When should I use the := operator in data.table?

查看:103
本文介绍了我应该在data.table中使用:=运算符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

data.table 对象现在具有:=运算符。什么使这个运算符与所有其他赋值运算符不同?

解决方案

下面是一个示例,它显示了10分钟减少到1秒(来自首页的新闻)。这类似于为 data.frame 重新分配,但不会每次都复制整个表。

  m = matrix(1,nrow = 100000,ncol = 100)
DF = as.data.frame(m)
DT = as.data.table(m)

system.time(for(i in 1:1000)DF [i,1] < - i)
用户系统已过
287.062 302.627 591.984
$ b $
用户系统已过
1.148 0.000 1.158(511倍快)

:= 放入 j 喜欢允许更多的成语:

  DT [a,done:= TRUE] a]和设置标志
DT [,newcol:= 42]#通过引用添加一个新列(不复制现有数据)
DT [,col:= NULL]#通过引用删除列

和:

 code> DT [,newcol:= sum(v),by = group]#像一个快速转换()by group 

我不能想到任何理由避免:= !除此之外,在中为循环。由于:= 出现在 DT [...] c> [。data.table method;例如S3调度并检查参数的存在和类型,例如 i 通过 nomatch 等。因此对于 for 循环,有一个低开销,直接版本:= 调用 set 。有关更多详细信息和示例,请参阅?set set 的缺点包括 i 必须是行号(无二进制搜索),您不能将其与。通过使这些限制设置可以显着地减少开销。

 系统。时间(for(i in 1:1000)set(DT,i,V1,i))
用户系统已过
0.016 0.000 0.018
pre>

data.table objects now have a := operator. What makes this operator different from all other assignment operators? Also, what are its uses, how much faster is it, and when should it be avoided?

解决方案

Here is an example showing 10 minutes reduced to 1 second (from NEWS on homepage). It's like subassigning to a data.frame but doesn't copy the entire table each time.

m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)

system.time(for (i in 1:1000) DF[i,1] <- i)
     user  system elapsed 
  287.062 302.627 591.984 

system.time(for (i in 1:1000) DT[i,V1:=i])
     user  system elapsed 
    1.148   0.000   1.158     ( 511 times faster )

Putting the := in j like that allows more idioms :

DT["a",done:=TRUE]   # binary search for group 'a' and set a flag
DT[,newcol:=42]      # add a new column by reference (no copy of existing data)
DT[,col:=NULL]       # remove a column by reference

and :

DT[,newcol:=sum(v),by=group]  # like a fast transform() by group

I can't think of any reasons to avoid := ! Other than, inside a for loop. Since := appears inside DT[...], it comes with the small overhead of the [.data.table method; e.g., S3 dispatch and checking for the presence and type of arguments such as i, by, nomatch etc. So for inside for loops, there is a low overhead, direct version of := called set. See ?set for more details and examples. The disadvantages of set include that i must be row numbers (no binary search) and you can't combine it with by. By making those restrictions set can reduce the overhead dramatically.

system.time(for (i in 1:1000) set(DT,i,"V1",i))
     user  system elapsed 
    0.016   0.000   0.018

这篇关于我应该在data.table中使用:=运算符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆