我应该在data.table中使用:=运算符? [英] When should I use the := operator in data.table?
问题描述
data.table
对象现在具有:=运算符。什么使这个运算符与所有其他赋值运算符不同?
下面是一个示例,它显示了10分钟减少到1秒(来自首页的新闻)。这类似于为 data.frame
重新分配,但不会每次都复制整个表。
m = matrix(1,nrow = 100000,ncol = 100)
DF = as.data.frame(m)
DT = as.data.table(m)
system.time(for(i in 1:1000)DF [i,1] < - i)
用户系统已过
287.062 302.627 591.984
$ b $
用户系统已过
1.148 0.000 1.158(511倍快)
将:=
放入 j
喜欢允许更多的成语:
DT [a,done:= TRUE] a]和设置标志
DT [,newcol:= 42]#通过引用添加一个新列(不复制现有数据)
DT [,col:= NULL]#通过引用删除列
和:
code> DT [,newcol:= sum(v),by = group]#像一个快速转换()by group
我不能想到任何理由避免:=
!除此之外,在中为
循环。由于:=
出现在 DT [...]
c> [。data.table method;例如S3调度并检查参数的存在和类型,例如 i
,通过
, nomatch
等。因此对于 for
循环,有一个低开销,直接版本:=
调用 set
。有关更多详细信息和示例,请参阅?set
。 set
的缺点包括 i
必须是行号(无二进制搜索),您不能将其与由
。通过使这些限制设置
可以显着地减少开销。
系统。时间(for(i in 1:1000)set(DT,i,V1,i))
pre>
用户系统已过
0.016 0.000 0.018
data.table
objects now have a := operator. What makes this operator different from all other assignment operators? Also, what are its uses, how much faster is it, and when should it be avoided?解决方案Here is an example showing 10 minutes reduced to 1 second (from NEWS on homepage). It's like subassigning to a
data.frame
but doesn't copy the entire table each time.m = matrix(1,nrow=100000,ncol=100) DF = as.data.frame(m) DT = as.data.table(m) system.time(for (i in 1:1000) DF[i,1] <- i) user system elapsed 287.062 302.627 591.984 system.time(for (i in 1:1000) DT[i,V1:=i]) user system elapsed 1.148 0.000 1.158 ( 511 times faster )
Putting the
:=
inj
like that allows more idioms :DT["a",done:=TRUE] # binary search for group 'a' and set a flag DT[,newcol:=42] # add a new column by reference (no copy of existing data) DT[,col:=NULL] # remove a column by reference
and :
DT[,newcol:=sum(v),by=group] # like a fast transform() by group
I can't think of any reasons to avoid
:=
! Other than, inside afor
loop. Since:=
appears insideDT[...]
, it comes with the small overhead of the[.data.table
method; e.g., S3 dispatch and checking for the presence and type of arguments such asi
,by
,nomatch
etc. So for insidefor
loops, there is a low overhead, direct version of:=
calledset
. See?set
for more details and examples. The disadvantages ofset
include thati
must be row numbers (no binary search) and you can't combine it withby
. By making those restrictionsset
can reduce the overhead dramatically.system.time(for (i in 1:1000) set(DT,i,"V1",i)) user system elapsed 0.016 0.000 0.018
这篇关于我应该在data.table中使用:=运算符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!