在data.table中有条件地替换数据值的最快方法(速度比较) [英] Fastest method to replace data values conditionally in data.table (speed comparison)

查看:739
本文介绍了在data.table中有条件地替换数据值的最快方法(速度比较)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么第二种方法将通过增加data.table大小变得更慢:

  library(data.table)
DF = data.table(x = rep(c(a,b,c),each = 40000000),y = sample(c(1,3,6),40000000,T) v = 1:9)

1:

  DF1 = DF2 = DF 

system.time(DF [y == 6,y]< -10)
用户系统已过
2.793 0.699 3.497

2:


$ b b

  system.time(DF1 $ y [DF1 $ y == 6]< -10)
用户系统已过
6.525 1.555 8.107

3:

  system.time(DF2 [y == 6,y:= 10])#slowest! 
用户系统已过
7.925 0.626 8.569

> sessionInfo()
R版本3.2.1(2015-06-18)
平台:x86_64 -pc-linux-gnu(64位)
运行时:Ubuntu 14.04.3 LTS



<

解决方案

在最后一种情况下,自从v1.9.4 +之后, data.table 中的 - 索引 $ <$>

code> DT [col%in%。] ,系统会自动在您的第一次运行时生成索引。索引只是您指定的列的 order 。索引的计算相当快(使用计数排序/真基准排序)。



表格为120百万列,大致需要:

  #clean session 
require(data.table)
set.seed(1L)
DF = data.table(x = rep(c(a,b,c ),each = 40000000),y = sample(c(1,3,6),40000000,T),v = 1:9)

system.time(data.table ::: forderv (DF,y))
#3.923 0.736 4.712




侧面注释 y 栏不需要真正 double (订购时间较长)。如果我们将它转​​换为整数类型:

  DF [,y:= as.integer(y)] 
.time(data.table ::: forderv(DF,y))
#用户系统已过
#0.569 0.140 0.717
== 或<$>的该列上的任何后续子集都可以使用




< c $ c>%in%会快速(幻灯片 R脚本 Matt的演示文稿的视频)。例如:

 #clean session,从上面复制/粘贴代码以创建DF 
system.time y == 6,y:= 10])
#用户系统已过
#4.750 1.121 5.932

system.time(DF [y == 6,y:= 10] ])
#用户系统已过
#4.002 0.907 4.969

分钟..它不快。但是..索引..?我们每次使用新值替换同一列。这会导致该列的顺序更改(从而删除索引)。让我们将 y 子集化,但修改 v

 #clean session 
require(data.table)
set.seed(1L)
DF = data.table(x = rep ,b,c),每个= 40000000),y = sample(c(1,3,6),40000000,T),v = 1:9)

system.time (DF [y == 6,v:= 10L])
#用户系统已过
#4.653 1.071 5.765
system.time )
#用户系统已过
#0.685 0.213 0.910

选项(datatable.verbose = TRUE)
system.time(DF [y == 6,v: = 10L])
#使用现有索引'y'
#开始bmerge ...在0秒内完成
#检测到j使用这些列:v
#分配到40000059行子集120000000行
#用户系统已过
#0.683 0.221 0.914

您可以看到计算索引的时间(使用二进制搜索)需要0秒。还要检查?set2key()



如果你不打算重复子集化, case,子集化和修改相同的列,然后通过选项(datatable.auto.index = FALSE),提交#1264

 #clean session 
require(data.table)
选项(datatable.auto.index = FALSE)#禁用自动索引
set.seed(1L)
= data.table(x = rep(c(a,b,c),each = 40000000),y = sample(c(1,3,6),40000000,T),v = 1 :

system.time(DF [y == 6,v:= 10L])
#用户系统已过
#1.067 0.274 1.367
系统。时间(DF [y == 6,v:= 10L])
#用户系统已过
#1.100 0.314 1.443

这里的区别不大。向量扫描的时间是 system.time(DF $ y == 6) = 0.448s



总而言之,在您的情况下,矢量扫描更有意义。但一般来说,这个想法是,最好是一次性支付罚金,并在该列的未来子集上获得快速结果,而不是每次向量扫描。


自动索引功能是相对较新的,并将随着时间的推移而扩展,可能已优化(也许有一些地方我们没有看过)。在回答这个问题时,我意识到我们没有显示时间来计算排序顺序(使用 fsort(),我想在这里花费的时间可能是原因时间非常接近,提交的#1265 )。







至于你的第二种情况很慢,不太确定为什么。我怀疑这可能是由于R的一部分不必要的副本。你使用什么版本的R?对于未来,始终发布您的 sessionInfo()输出。


Why the second method will become slower by increasing the data.table size:

library(data.table)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

1:

DF1=DF2=DF

system.time(DF[y==6,"y"]<-10)
user  system elapsed 
2.793   0.699   3.497 

2:

system.time(DF1$y[DF1$y==6]<-10)
user  system elapsed 
6.525   1.555   8.107 

3:

system.time(DF2[y==6, y := 10]) # slowest!
user  system elapsed 
7.925   0.626   8.569 

>sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

Is there any faster way to do this?

解决方案

On your last case, it is a consequence of auto-indexing feature in data.table, since v1.9.4+. Read more for the full picture :-).

When you do DT[col == .] or DT[col %in% .], an index is being generated automatically on your first run. The index is just the order of the column you specify. The computation of indices is quite fast (using counting sort / true radix sorting).

The table is 120 million rows and it takes roughly:

# clean session
require(data.table)
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(data.table:::forderv(DF, "y"))
#   3.923   0.736   4.712 

Side note: Column y need not be really double (on which ordering takes longer). If we convert it to integer type:

   DF[, y := as.integer(y)]
   system.time(data.table:::forderv(DF, "y"))
   #    user  system elapsed 
   #   0.569   0.140   0.717 

The advantage is that any subsequent subsets on that column using == or %in% will be blazing fast (Slides, R script, video of Matt's presentation). For example:

# clean session, copy/paste code from above to create DF
system.time(DF[y==6, y := 10])
#    user  system elapsed 
#   4.750   1.121   5.932 

system.time(DF[y==6, y := 10])
#    user  system elapsed 
#   4.002   0.907   4.969 

Oh wait a minute.. it isn't fast. But.. indexing..?!? We're replacing the same column every time with a new value. This results in the order of that column getting changed (thereby removing the index). Let's subsetting y, but modifying v:

# clean session
require(data.table)
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   4.653   1.071   5.765 
system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   0.685   0.213   0.910 

options(datatable.verbose=TRUE)
system.time(DF[y==6, v := 10L])
# Using existing index 'y'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: v 
# Assigning to 40000059 row subset of 120000000 rows
#    user  system elapsed 
#   0.683   0.221   0.914 

You can see that the time to compute the indices (using binary search) takes 0 seconds. Also check ?set2key().

If you're not going to do repeated subsetting, or as in your case, subsetting and modifying the same column, then it makes sense to disable the feature by doing options(datatable.auto.index = FALSE), filed #1264:

# clean session
require(data.table)
options(datatable.auto.index = FALSE) # disable auto indexing
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   1.067   0.274   1.367 
system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   1.100   0.314   1.443 

The difference isn't much here. The time to vector scan is system.time(DF$y == 6) = 0.448s.

To sum up, in your case, vector scan makes more sense. But in general, the idea is that it's better to pay the penalty once and have fast results on future subsets on that column, rather than vector scanning each and every time.

Auto indexing feature is relatively new, and will be extended over time, and probably optimised (perhaps there are places we've not looked at). While answering this Q, I realised that we don't show the time to compute the sort order (using fsort(), and I guess the time spent there might be the reason the timings are quite close, filed #1265).


As to your second case being slow, not quite sure why. I suspect it might be due to unnecessary copies from R's part. What version of R are you using? For the future, always post your sessionInfo() output.

这篇关于在data.table中有条件地替换数据值的最快方法(速度比较)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆