data.table:向量扫描v二进制搜索与数字列 - 超慢setkey [英] data.table: vector scan v binary search with numeric columns - super-slow setkey
问题描述
我试图找到最快的方法,通过几个数字列将大型数据集子集。如数据表所承诺的,进行二分搜索所花费的时间比矢量扫描快得多。然而,二进制搜索需要事先执行setkey。正如你在这段代码中看到的,它需要一个非常长的时间!一旦你考虑到这个时间,向量扫描要快得多:
I am trying to find the quickest way to subset a large dataset by several numeric columns. As promised by data.table, the time taken to do binary search is much quicker than for vector scanning. Binary search, however, requires setkey to be performed beforehand. As you see in this code, it takes an exceptionally long time! Once you take that time into account, vector scanning is much much faster:
set.seed(1)
n=10^7
nums <- round(runif(n,0,10000))
DT = data.table(s=sample(nums,n), exp=sample(nums,n),
init=sample(nums,n), contval=sample(nums,n))
this_s = DT[0.5*n,s]
this_exp = DT[0.5*n,exp]
this_init = DT[0.5*n,init]
system.time(ans1<-DT[s==this_s&exp==this_exp&init==this_init,4,with=FALSE])
# user system elapsed
# 0.65 0.01 0.67
system.time(setkey(DT,s,exp,init))
# user system elapsed
# 41.56 0.03 41.59
system.time(ans2<-DT[J(this_s,this_exp,this_init),4,with=FALSE])
# user system elapsed
# 0 0 0
identical(ans1,ans2)
# [1] TRUE
我做错了什么?我已阅读data.table常见问题等。任何帮助将非常感谢。
Am I doing something wrong? I've read through the data.table FAQs etc. Any help would be greatly appreciated.
非常感谢。
推荐答案
行:
nums <- round(runif(n,0,10000))
将 nums
numeric
不是 integer
。这有一个很大的区别。 data.table常见问题和介绍适用于 integer
和字符
列;在这些类型上,您不会看到 setkey
很慢。例如:
leaves nums
as type numeric
not integer
. That makes a big difference. The data.table FAQs and introduction are geared towards integer
and character
columns; you won't see setkey
as slow on those types. For example :
nums <- as.integer(round(runif(n,0,10000)))
...
setkey(DT,s,exp,init) # much faster now
另外两个点,虽然...
Two further points though ...
首先,排序/排序操作在当前开发版本的data.table v1 .8.11。 @jihoward是关于排序数字列是更加耗时的操作。但是,在1.8.11版本中仍然快5-8倍(因为6遍radix顺序实现,检查 this post )。比较在1.8.10和1.8.11之间 setkey
操作所用的时间:
First, the ordering/sorting operations are much faster in the current development version of data.table v1.8.11. @jihoward is right on about sorting on numeric columns being much more time-consuming operation. But, still it's about 5-8x faster in 1.8.11 (because of a 6-pass radix order implementation, check this post). Comparing the time taken for the setkey
operation between 1.8.10 and 1.8.11:
# v 1.8.11
system.time(setkey(DT,s,exp,init))
# user system elapsed
# 8.358 0.375 8.844
# v 1.8.10
system.time(setkey(DT,s,exp,init))
# user system elapsed
# 66.609 0.489 75.216
这是我系统的8.5倍改进。所以,我的猜测是这花了大约4.9秒的时间。
It's a 8.5x improvement on my system. So, my guess is this'd take about 4.9 seconds on yours.
其次,如@Roland提到的,如果你的目标是执行几个子集化,所有你要做的,然后做一个 setkey 当然没有意义,它必须找到列的顺序,然后重新排序整个data.table(通过引用内存占用量非常小,请检查此帖子了解更多关于setkey)。
Second, as @Roland mentions, if your objective is to perform a couple of subsetting and that is ALL you're going to do, then of course it doesn't make sense to do a setkey as, it has to find the order of columns and then reorder the entire data.table (by reference so that the memory footprint is very minimal, check this post for more on setkey).
这篇关于data.table:向量扫描v二进制搜索与数字列 - 超慢setkey的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!