data.table:向量扫描v二进制搜索与数字列 - 超慢setkey [英] data.table: vector scan v binary search with numeric columns - super-slow setkey

查看:110
本文介绍了data.table:向量扫描v二进制搜索与数字列 - 超慢setkey的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到最快的方法,通过几个数字列将大型数据集子集。如数据表所承诺的,进行二分搜索所花费的时间比矢量扫描快得多。然而,二进制搜索需要事先执行setkey。正如你在这段代码中看到的,它需要一个非常长的时间!一旦你考虑到这个时间,向量扫描要快得多:

I am trying to find the quickest way to subset a large dataset by several numeric columns. As promised by data.table, the time taken to do binary search is much quicker than for vector scanning. Binary search, however, requires setkey to be performed beforehand. As you see in this code, it takes an exceptionally long time! Once you take that time into account, vector scanning is much much faster:

set.seed(1)
n=10^7
nums <- round(runif(n,0,10000))
DT = data.table(s=sample(nums,n), exp=sample(nums,n), 
         init=sample(nums,n), contval=sample(nums,n))
this_s = DT[0.5*n,s] 
this_exp = DT[0.5*n,exp]
this_init = DT[0.5*n,init]
system.time(ans1<-DT[s==this_s&exp==this_exp&init==this_init,4,with=FALSE])
#   user  system elapsed 
#   0.65    0.01    0.67 
system.time(setkey(DT,s,exp,init))
#   user  system elapsed 
#  41.56    0.03   41.59 
system.time(ans2<-DT[J(this_s,this_exp,this_init),4,with=FALSE])
#   user  system elapsed 
#    0       0       0 
identical(ans1,ans2)
# [1] TRUE

我做错了什么?我已阅读data.table常见问题等。任何帮助将非常感谢。

Am I doing something wrong? I've read through the data.table FAQs etc. Any help would be greatly appreciated.

非常感谢。

推荐答案

行:

nums <- round(runif(n,0,10000))

nums numeric 不是 integer 。这有一个很大的区别。 data.table常见问题和介绍适用于 integer 字符列;在这些类型上,您不会看到 setkey 很慢。例如:

leaves nums as type numeric not integer. That makes a big difference. The data.table FAQs and introduction are geared towards integer and character columns; you won't see setkey as slow on those types. For example :

nums <- as.integer(round(runif(n,0,10000)))
...
setkey(DT,s,exp,init)  # much faster now

另外两个点,虽然...

Two further points though ...

首先,排序/排序操作在当前开发版本的data.table v1 .8.11。 @jihoward是关于排序数字列是更加耗时的操作。但是,在1.8.11版本中仍然快5-8倍(因为6遍radix顺序实现,检查 this post )。比较在1.8.10和1.8.11之间 setkey 操作所用的时间:

First, the ordering/sorting operations are much faster in the current development version of data.table v1.8.11. @jihoward is right on about sorting on numeric columns being much more time-consuming operation. But, still it's about 5-8x faster in 1.8.11 (because of a 6-pass radix order implementation, check this post). Comparing the time taken for the setkey operation between 1.8.10 and 1.8.11:

# v 1.8.11
system.time(setkey(DT,s,exp,init))
#    user  system elapsed 
#   8.358   0.375   8.844 

# v 1.8.10
system.time(setkey(DT,s,exp,init))
#   user  system elapsed 
# 66.609   0.489  75.216 

这是我系统的8.5倍改进。所以,我的猜测是这花了大约4.9秒的时间。

It's a 8.5x improvement on my system. So, my guess is this'd take about 4.9 seconds on yours.

其次,如@Roland提到的,如果你的目标是执行几个子集化,所有你要做的,然后做一个 setkey 当然没有意义,它必须找到列的顺序,然后重新排序整个data.table(通过引用内存占用量非常小,请检查此帖子了解更多关于setkey)。

Second, as @Roland mentions, if your objective is to perform a couple of subsetting and that is ALL you're going to do, then of course it doesn't make sense to do a setkey as, it has to find the order of columns and then reorder the entire data.table (by reference so that the memory footprint is very minimal, check this post for more on setkey).

这篇关于data.table:向量扫描v二进制搜索与数字列 - 超慢setkey的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆