使用二分搜索按范围对 data.table 进行子集化 [英] Subsetting a data.table by range making use of binary search

查看:17
本文介绍了使用二分搜索按范围对 data.table 进行子集化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何通过数字范围对 data.table 进行子集化,以使用二分搜索?

How do you go about subsetting a data.table by a numeric range, with the intention of using binary search?

例如:

require(data.table)
set.seed(1)

x<-runif(10000000,min=0,max=10)
y<-runif(10000000,min=0,max=10)

DF<-data.frame(x,y)
DT<-data.table(x,y)

system.time(DFsub<-DF[DF$x>5 & DF$y<7,])
# user  system elapsed 
# 1.529   0.250   1.821 

#subset DT
system.time(DTsub<-DT[x>5 & y<7])
# user  system elapsed 
#0.716   0.119   0.841 

上面没有使用键(矢量扫描),并且加速不是那么显着.使用二进制搜索对 data.table 的数字范围进行子集化的语法是什么?我在文档中找不到一个很好的例子;如果有人可以使用上面的玩具 data.table 提供示例,那将会很有帮助.

The above doesn't use a key (vector scan), and the speed-up isn't so dramatic. What is the syntax for subsetting a numeric range of a data.table, making use of binary search? I can't find a good example in the documentation; it would be helpful if someone could provide an example using the toy data.table above.

这个问题很相似,但仍然没有演示如何按范围进行子集化:数据.表:向量扫描 v 带有数字列的二进制搜索 - 超慢 setkey

This question is similar, but still doesn't demonstrate how to subset by a range: data.table: vector scan v binary search with numeric columns - super-slow setkey

推荐答案

有趣的问题.首先让我们看一下示例数据:

Interesting question. First let's look at the example data :

> print(DT)
                     x        y
       1: 2.607703e-07 5.748127
       2: 8.894131e-07 5.233994
       3: 1.098961e-06 9.834267
       4: 1.548324e-06 2.016585
       5: 1.569279e-06 7.957730
      ---                      
 9999996: 9.999996e+00 9.977782
 9999997: 9.999998e+00 2.666575
 9999998: 9.999999e+00 6.869967
 9999999: 9.999999e+00 1.953145
10000000: 1.000000e+01 4.001616
> length(DT$x)
[1] 10000000
> length(unique(DT$x))
[1] 9988478
> length(DT$y)
[1] 10000000
> length(unique(DT$y))
[1] 9988225
> DT[,.N,by=x][,table(N)]
N
      1       2       3 
9976965   11504       9 
> DT[,.N,by="x,y"][,table(N)]
N
       1 
10000000 
> 

因此,第一列中有近 1000 万个唯一浮点值:几组大小为 2 行和 3 行,但大多数为 1 行组.一旦包含第二列,就有 1000 万个大小为 1 行的唯一组.这是一个相当棘手的问题,因为 data.table 更多地是为分组数据而设计的;例如,(id,日期),(id1,id2,日期,时间)等

So there are almost 10 million unique floating point values in the first column: a few groups of size 2 and 3 rows but mostly 1 row groups. Once the second column is including, there are 10 million unique groups of size 1 row. This is quite a tough problem, since data.table is designed more for grouped data in mind; e.g, (id, date), (id1, id2, date, time) etc.

但是,data.tablesetkey 确实支持键中的浮点数据,所以让我们试一试吧.

However, data.table and setkey do support floating point data in keys, so let's give it a go.

在我的慢速上网本上:

> system.time(setkey(DT,x,y))
   user  system elapsed 
  7.097   0.520   7.650 

> system.time(DT[x>5 & y<7])
   user  system elapsed 
  2.820   0.292   3.122 

所以矢量扫描的方法比设置密钥要快(而且我们还没有使用密钥).鉴于数据是浮点数并且几乎是唯一的,所以这并不太令人惊讶,但我认为对于 setkey 来说这是一个相当快的时间来对 1000 万个完全随机且几乎唯一的双精度数进行排序.

So the vector scanning approach is faster than setting the key (and we haven't even used the key yet). Given the data is floating point and almost unique then this isn't too surprising, but I think that's a pretty fast time for setkey to sort 10 million thoroughly random and almost unique doubles.

例如,与 base 相比,仅对 x 排序,甚至对 y 也进行排序:

Compare to base for example, just sorting x not even y as well :

> system.time(base::order(x))
   user  system elapsed 
 72.445   0.292  73.072 

假设这个数据代表你的真实数据,而且你不想只做一次而是多次,所以愿意付出setkey的代价,第一步很漂亮明确:

Assuming this data is representative of your real data, and you don't want to do this just once but several times, so are willing to pay the price of setkey, the first step is pretty clear :

system.time(w <- DT[.(5),which=TRUE,roll=TRUE])
   user  system elapsed 
  0.004   0.000   0.003 
> w
[1] 4999902

但是在这里我们被困住了.像 DT[(w+1):nrow(DT)] 这样的下一步是丑陋的和复制的.我想不出一种体面的方式来使用这里的密钥来执行 y<7 部分.在其他示例数据中,我们执行 DT[.(unique(x), 7), which=TRUE, roll=TRUE] 之类的操作,但在这种情况下,数据是如此独特且浮点数慢慢来.

But here we're stuck. A next step like DT[(w+1):nrow(DT)] is ugly and copies. I can't think of a decent way to use the key from here to do the y<7 part as well. In other example data we do something like DT[.(unique(x), 7), which=TRUE, roll=TRUE] but in this case the data is so unique and floating point that's going to be slow.

理想情况下,此任务需要 范围连接 (FR#203) 实现.此示例中的语法可能是:

Ideally, this task needs range joins (FR#203) implementing. The syntax in this example might be :

DT[.( c(5,Inf), c(-Inf,7) )]

或者为了更简单,DT[x>5 &y<7] 可以在后台进行优化以做到这一点.允许 i 中的两列范围连接到相应的 x 列可能非常有用,并且已经出现了好几次.

or to make it easier, DT[x>5 & y<7] could be optimized to do that under the hood. Allowing a two-column range in i that joins to the corresponding x columns could be quite useful and has come up several times.

需要先完成 v1.9.2 中的加速,然后我们才能继续进行类似的操作.如果您在 v1.8.10 中对此数据尝试 setkey,您会发现 v1.9.2 明显更快.

The speedups in v1.9.2 needed to be done first before we could move on to things like that. If you try setkey on this data in v1.8.10 you'll find that v1.9.2 is significantly faster.

另见:

如何自行加入 data.table条件

删除data.table中的范围

这篇关于使用二分搜索按范围对 data.table 进行子集化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆