使用二进制搜索通过范围子集设置数据表 [英] Subsetting a data.table by range making use of binary search

查看:97
本文介绍了使用二进制搜索通过范围子集设置数据表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用数字范围来子集化data.table,以便使用二进制搜索?

How do you go about subsetting a data.table by a numeric range, with the intention of using binary search?

例如:

require(data.table)
set.seed(1)

x<-runif(10000000,min=0,max=10)
y<-runif(10000000,min=0,max=10)

DF<-data.frame(x,y)
DT<-data.table(x,y)

system.time(DFsub<-DF[DF$x>5 & DF$y<7,])
# user  system elapsed 
# 1.529   0.250   1.821 

#subset DT
system.time(DTsub<-DT[x>5 & y<7])
# user  system elapsed 
#0.716   0.119   0.841 

上面没有使用键(向量扫描),加速不是那么戏剧。使用二进制搜索对数据表的数值范围进行子集化的语法是什么?
我在文档中找不到一个很好的例子;如果有人可以提供一个使用玩具data.table的示例将是有帮助的。

The above doesn't use a key (vector scan), and the speed-up isn't so dramatic. What is the syntax for subsetting a numeric range of a data.table, making use of binary search? I can't find a good example in the documentation; it would be helpful if someone could provide an example using the toy data.table above.

编辑:这个问题是类似的,但仍然没有演示如何子集a范围:
data.table:向量扫描v使用数字列的二进制搜索 - 超慢setkey

This question is similar, but still doesn't demonstrate how to subset by a range: data.table: vector scan v binary search with numeric columns - super-slow setkey

推荐答案

有趣的问题。首先让我们看看示例数据:

Interesting question. First let's look at the example data :

> print(DT)
                     x        y
       1: 2.607703e-07 5.748127
       2: 8.894131e-07 5.233994
       3: 1.098961e-06 9.834267
       4: 1.548324e-06 2.016585
       5: 1.569279e-06 7.957730
      ---                      
 9999996: 9.999996e+00 9.977782
 9999997: 9.999998e+00 2.666575
 9999998: 9.999999e+00 6.869967
 9999999: 9.999999e+00 1.953145
10000000: 1.000000e+01 4.001616
> length(DT$x)
[1] 10000000
> length(unique(DT$x))
[1] 9988478
> length(DT$y)
[1] 10000000
> length(unique(DT$y))
[1] 9988225
> DT[,.N,by=x][,table(N)]
N
      1       2       3 
9976965   11504       9 
> DT[,.N,by="x,y"][,table(N)]
N
       1 
10000000 
> 

因此,第一列中有近1000万个唯一浮点值:和3行,但主要是1行组。一旦第二列包括,有大小1行的1000万个唯一组。这是一个相当困难的问题,因为 data.table 设计更多用于分组数据;例如(id,date),(id1,id2,date,time)等。

So there are almost 10 million unique floating point values in the first column: a few groups of size 2 and 3 rows but mostly 1 row groups. Once the second column is including, there are 10 million unique groups of size 1 row. This is quite a tough problem, since data.table is designed more for grouped data in mind; e.g, (id, date), (id1, id2, date, time) etc.

但是, data.table setkey 在键中支持浮点数据,所以让我们一起去吧。

However, data.table and setkey do support floating point data in keys, so let's give it a go.

netbook:

> system.time(setkey(DT,x,y))
   user  system elapsed 
  7.097   0.520   7.650 

> system.time(DT[x>5 & y<7])
   user  system elapsed 
  2.820   0.292   3.122 

所以矢量扫描方法比设置键快(我们甚至还没有使用键)。给定数据是浮点数和几乎唯一,那么这不是太惊讶,但我认为这是一个相当快的时间为 setkey 排序1000万完全随机和几乎唯一的双。

So the vector scanning approach is faster than setting the key (and we haven't even used the key yet). Given the data is floating point and almost unique then this isn't too surprising, but I think that's a pretty fast time for setkey to sort 10 million thoroughly random and almost unique doubles.

比较基础例如,只是排序 x 甚至 y 以及:

Compare to base for example, just sorting x not even y as well :

> system.time(base::order(x))
   user  system elapsed 
 72.445   0.292  73.072 

假设此数据代表您的真实数据,并且您不想只执行一次而是多次,所以愿意支付 setkey ,第一步很清楚:

Assuming this data is representative of your real data, and you don't want to do this just once but several times, so are willing to pay the price of setkey, the first step is pretty clear :

system.time(w <- DT[.(5),which=TRUE,roll=TRUE])
   user  system elapsed 
  0.004   0.000   0.003 
> w
[1] 4999902

下一步如 DT [(w + 1):nrow(DT)] 是丑陋的并复制。我不能想象一个体面的方式使用键从这里做 y< 7 部分以及。在其他示例数据中,我们执行类似 DT [。(unique(x),7),which = TRUE,roll = TRUE] ,但在这种情况下,

But here we're stuck. A next step like DT[(w+1):nrow(DT)] is ugly and copies. I can't think of a decent way to use the key from here to do the y<7 part as well. In other example data we do something like DT[.(unique(x), 7), which=TRUE, roll=TRUE] but in this case the data is so unique and floating point that's going to be slow.

理想情况下,此任务需要 range join(FR#203)实现。此示例中的语法可能是:

Ideally, this task needs range joins (FR#203) implementing. The syntax in this example might be :

DT[.( c(5,Inf), c(-Inf,7) )]

或更容易, DT [x> 5& y< 7] 可以被优化以在引擎盖下进行。允许在i中加入相应x列的两列范围可能非常有用,并且已经出现了几次。

or to make it easier, DT[x>5 & y<7] could be optimized to do that under the hood. Allowing a two-column range in i that joins to the corresponding x columns could be quite useful and has come up several times.

v1.9.2中的加速需要在我们可以移动到这样的事情之前先做。如果您在v1.8.10中对此数据尝试 setkey ,您会发现v1.9.2的速度明显更快。

The speedups in v1.9.2 needed to be done first before we could move on to things like that. If you try setkey on this data in v1.8.10 you'll find that v1.9.2 is significantly faster.

另请参阅:

如何在条件上自加入data.table

删除data.table中的范围

这篇关于使用二进制搜索通过范围子集设置数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆