我们可以在带有OR选择查询的data.table中进行二分查找 [英] Can we do binary search in data.table with OR select queries

查看：123 发布时间：2017/3/12 11:23:27 r data.table

本文介绍了我们可以在带有OR选择查询的data.table中进行二分查找的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在上一个问题之后使用 data.table

  DT = data.table（x = 1e7，T），y = sample（1：25,1e7，T），rnorm（1e7））
 setkey（DT，x，y）

我们可以使用二进制搜索找到

  DT [x == 'a'| y == 25]

请记住， DT [J（'a' 25）] == DT [x =='a'& y == 25]

解决方案

是：

一个二叉serach，我们需要适当的索引。

  indx < -  rbind（DT [y == 25，list（y = 25），by = x] [indx，setdiff（indx，unique（DT [，key（x）]） （DT），with = FALSE]））
 indx 
 
 DT [。（indx）]

基准化：

这给予我们超过<强>超过矢量化serach。

 相同（setkey（DT [。（indx）]），setkey（DT [x ==a| y = = 25]））
＃[1] TRUE 
 
 
库（microbenchmark）
 microbenchmark（UsingIndx = DT [。（indx）]，UsingVecSearch = DT [ x ==a| y == 25]，times = 100）
 
单位：毫秒
 expr min lq median uq max 
 1使用Indx 34.27562 41.70119 48.13215 49.29752 231.1669 
 2 UsingVecSearch 506.62670 545.85673 636.67701 680.93894 802.0842 
  
 
 
 
 
 
 < 
 
 为了方便起见，我们可以将代码的创建索引部分包装到一个好的函数
中，这样我们可以调用它在单行。例如：
  DT [。（OrIndx（a，25，DT））] 
  
其中 OrIndx（）定义如下：
  OrIndx < -  function（xval，yval，DT）{
＃TODO：允许任意列和列名称
 if （$ is.data.table（DT））
 stop（DT不是数据表）
 
＃创建所有适当的组合
 indx < -  rbind DT [y == yval，list（y = yval），by = x]，DT [。（xval），list（x = xval），by = y]，use.names = TRUE）
 
＃取出indx中实际不存在于DT中的任何组合并返回
 return（setdiff（indx，setdiff（indx，unique（DT [，key（DT），with = FALSE]）））） 
} 
  
 
 
 
说明：
 
 
 这里的想法是，执行或操作需要某种形式的组合。
 
在标准向量搜索中，此组合是每个单独向量搜索的结果。 
 
 
  data.table允许使用
  DT [。（c（cdf，tmb），c（25,3））] 
  
 
 $ b b 
因此，问题的自然解决方案是使用：
  DT [。值x>，a），c（25，  
唯一的问题是回收不会正确排列。
 
这将是理想的选择像
  DT [。（list（c（unique（x），y = 25），c（x =a，y = unique（y）））] 
  
但是据我所知，尚未实现（尚未实现）
 
因此，我们可以采取适当的组合。
 
上面的函数 OrIndx 确实如此。（快速&脏，有更有效的方式创建索引）
 
 
 
 
 
 使用其他基准更新
 
 
 根据@Aruns建议，我们包括
  rbind（DT [J（a）]，DT [J（setdiff唯一的（x），a），25）]）
 rbindlist（list（DT [ ]））
  
在1e6和1e7行测试：
 
 $ b b 
  ##使用1百万行
>微基准（Using_Indx = DT [。（indx）]，Using_RbindList = rbindlist（list（DT [J（a）]，DT [J（setdiff（unique（x），a）， Use_Rbind = rbind（DT [J（a）]，DT [J（setdiff（unique（x），a），25）]），Using_VecSearch = DT [x ==a| y == 25 ]，times = 70L）
单位：毫秒
 expr min lq median uq max 
 1 Using_Indx 4.865089 5.755615 5.813938 5.957352 6.880743 
 2 Using_Rbind 42.657953 49.239558 49.682407 50.505977 139.770670 
 3 Using_RbindList 36.319170 44.169151 44.484350 45.279158 155.361338 
 4 Using_VecSearch 49.003307 64.030384 64.443666 65.123886 150.099946 
 
 
 ##使用10 Milliion行
单位：毫秒
 expr min lq median uq max 
 1 Using_Indx 33.71108 47.5402 48.7574 50.75285 122.0950 
 2 Using_rbind 492.38244 535.6062 565.8623 590.92841 727.3907 
 3 Using_RbindList 436.29325 478.3626 507.4665 525.25980 657.6639 
 4 Using_VecSearch 511.86248 607.8046 643.9822 688.36733 773.3997 
 
＃确保所有相同的结果：
> （setkey（DT [。（indx）]），setkey（DT [x ==a| y == 25]））
 [1] TRUE 
>相同的（setkey（DT [。（indx）]），setkey（rbind（DT [J（a）]，DT [J（setdiff（unique（x），a），25） b $ b [1] TRUE 
  
请注意，对于SMALL tabbles（小于 15K 行），向量搜索更快（对于真正小的表，大约两倍快）
  ##使用100行
>微基准（Using_Indx = DT [。（indx）]，Using_RbindList = rbindlist（list（DT [J（a）]，DT [J（setdiff（unique（x），a）， Using_rbind = rbind（DT [J（a）]，DT [J（setdiff（unique（x），a），25）]），Using_VecSearch = DT [x ==a| y == 25 ]，times = 150L）
单位：微秒
 expr min lq median uq max 
 1 Using_Indx 884.819 901.854 917.3715 933.642 9740.046 
 2 Using_rbind 2385.842 2424.893 2462.5210 2502.704 4266.637 
 3 Using_RbindList 1962.504 2005.594 2027.4085 2069.516 4238.146 
 4 Using_VecSearch 386.867 401.328 407.5730 420.647 2908.090 
  
约10,000行，此时我们开始看到增益：
  ## 10,000行
单位：微秒
 expr min lq median uq max 
 1 Using_Indx 891.374 921.784 931.6585 956.737 3780.971 
 4 Using_VecSearch 796.316 815.965 824.1480 845.151 2531.314 
 
 ## 15,000行
单位：微秒
 expr min lq median uq max 
 1 Using_Indx 913.963 939.198 954.518 986.609 2900.174 
 4 Using_VecSearch 1018.830 1041.449 1053.098 1072.188 8418.470 
 
 ## 30,000行
单位：微秒
 expr min lq median uq max 
 1 Using_Indx 964.402 995.883 1018.535 1045.908 5999.390 
 4 Using_VecSearch 1649.231 1709.090 1801.760 1927.976 8868.470 
 
 ## 100,000行
单位：milliseconds 
 expr min lq median uq max 
 1 Using_Indx 1.142318 1.181023 1.198611 1.268417 3.611945 
 4 Using_VecSearch 4.663948 4.763179 5.052995 6.058354 12.133510 
 
 ## 10,000,000行（仅运行30 
单位：毫秒
 expr min lq median uq max 
 1 Using_Indx 33.95004 42.24995 48.90363 50.15424 177.0991 
 2 Using_VecSearch 512.34760 557.02867 622.37670 662.14323 861.3465 
  
 
Following the previous question using data.table
DT = data.table(x=sample(letters,1e7,T),y=sample(1:25,1e7,T),rnorm(1e7))
setkey(DT,x,y)
Can we use binary search to find 
DT[x=='a' | y==25]
Remember that DT[J('a',25)] == DT[x=='a' & y==25]
 解决方案 
Yes:

In order to do a binary serach, we need the appropriate indices. 
  indx <- rbind(DT[y==25, list(y=25), by=x], DT[.("a"), list(x="a"), by=y], use.names=TRUE)
  indx <- setdiff(indx, setdiff(indx, unique(DT[, key(DT), with=FALSE])))
  indx

  DT[.(indx)]




Benchmarking:

This gives us more than a 10x improvement over vectorized serach. 
  identical(setkey(DT[.(indx)]), setkey(DT[x=="a" | y == 25]))
  # [1] TRUE


  library(microbenchmark)
  microbenchmark(UsingIndx = DT[.(indx)],  UsingVecSearch = DT[x=="a" | y == 25], times=100 )

  Unit: milliseconds
              expr       min        lq    median        uq      max
  1      UsingIndx  34.27562  41.70119  48.13215  49.29752 231.1669
  2 UsingVecSearch 506.62670 545.85673 636.67701 680.93894 802.0842






For convenience, we can wrap the "creating the index" portion of the code into a nice function, 
so that we can then call it in a single line.  For example: 
DT[.(OrIndx("a", 25, DT))]
Where OrIndx() is defined as follows:  
OrIndx <- function(xval, yval, DT)  {
  # TODO: Allow for arbitrary columns and column names
  if(!is.data.table(DT))
    stop("DT is not a data.table")

  # create all appropriate combinations
  indx <- rbind(DT[y==yval, list(y=yval), by=x], DT[.(xval), list(x=xval), by=y], use.names=TRUE)

  # take out any combinations in indx that are not actually present in DT and return 
  return( setdiff(indx, setdiff(indx, unique(DT[, key(DT), with=FALSE]))) )
}


Explanation:

The idea here is that performing an "or" serach requires some form of combination.

In a standard vector search, this combination is of the results of each individual vector serach. 

data.table offers some great speed improvements by allowing seraches such as 
 DT[.(c("cdf", "tmb"), c(25, 3))] 
Therefore, a natural solution to the question would be to use:
 DT[.(c(<all values of x>, "a"), c(25, <all values of y>))] 
The only problem is that the recycling would not line up properly.

It would be ideal to have an option like 
 DT[.(  list( c(unique(x), y=25), c(x="a", y=unique(y) )  )]
But as far as I can tell that has not been implemented (yet!)

So instead, we can take appropriate combinations.

The function OrIndx above does exactly that.  (it s quick & dirty and there are more efficient ways of creating the index)



Update with additional benchmarks

As per @Aruns suggestion, we include 
rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])
rbindlist(list( DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)] ))
Tested on 1e6 and 1e7 rows:
  ## Using 1 Million rows
  > microbenchmark(Using_Indx = DT[.(indx)], Using_RbindList = rbindlist(list(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])), Using_Rbind = rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]),  Using_VecSearch = DT[x=="a" | y == 25], times=70L )
  Unit: milliseconds
               expr       min        lq    median        uq        max
  1      Using_Indx  4.865089  5.755615  5.813938  5.957352   6.880743
  2     Using_Rbind 42.657953 49.239558 49.682407 50.505977 139.770670
  3 Using_RbindList 36.319170 44.169151 44.484350 45.279158 155.361338
  4 Using_VecSearch 49.003307 64.030384 64.443666 65.123886 150.099946


  ## Using 10 Milliion rows
  Unit: milliseconds
               expr       min       lq   median        uq      max
  1      Using_Indx  33.71108  47.5402  48.7574  50.75285 122.0950
  2     Using_rbind 492.38244 535.6062 565.8623 590.92841 727.3907
  3 Using_RbindList 436.29325 478.3626 507.4665 525.25980 657.6639
  4 Using_VecSearch 511.86248 607.8046 643.9822 688.36733 765.3997

  # Making sure all the same results: 
  > identical(setkey(DT[.(indx)]), setkey(DT[x=="a" | y == 25]))
  [1] TRUE
  > identical(setkey(DT[.(indx)]), setkey(rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])))
  [1] TRUE
Note that for SMALL tabbles (less than 15K rows), vector search is faster (for really small tables, about twice as fast)
  ## Using  100 Rows
  > microbenchmark(Using_Indx = DT[.(indx)], Using_RbindList = rbindlist(list(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])), Using_rbind = rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]),  Using_VecSearch = DT[x=="a" | y == 25], times=150L )
  Unit: microseconds
               expr      min       lq    median       uq      max
  1      Using_Indx  884.819  901.854  917.3715  933.642 9740.046
  2     Using_rbind 2385.842 2424.893 2462.5210 2502.704 4266.637
  3 Using_RbindList 1962.504 2005.594 2027.4085 2069.516 4238.146
  4 Using_VecSearch  386.867  401.328  407.5730  420.647 2908.090
This pattern holds until about 10,000 rows, at which point we start to see the gains: 
  ## 10,000 Rows
  Unit: microseconds
               expr      min       lq    median       uq      max
  1      Using_Indx  891.374  921.784  931.6585  956.737 3780.971
  4 Using_VecSearch  796.316  815.965  824.1480  845.151 2531.314

  ## 15,000 Rows
  Unit: microseconds
               expr      min       lq   median       uq      max
  1      Using_Indx  913.963  939.198  954.518  986.609 2900.174
  4 Using_VecSearch 1018.830 1041.449 1053.098 1072.188 8418.470

  ## 30,000 Rows
  Unit: microseconds
               expr      min       lq   median       uq       max
  1      Using_Indx  964.402  995.883 1018.535 1045.908  5999.390
  4 Using_VecSearch 1649.231 1709.090 1801.760 1927.976  8868.470

  ## 100,000 Rows
  Unit: milliseconds
               expr      min       lq   median       uq       max
  1      Using_Indx 1.142318 1.181023 1.198611 1.268417  3.611945
  4 Using_VecSearch 4.663948 4.763179 5.052995 6.058354 12.133510

  ## 10,000,000 Rows   (only ran 30 reps for this one)
  Unit: milliseconds
               expr       min        lq    median        uq      max
  1      Using_Indx  33.95004  42.24995  48.90363  50.15424 177.0991
  2 Using_VecSearch 512.34760 557.02867 622.37670 662.14323 861.3465


                        
这篇关于我们可以在带有OR选择查询的data.table中进行二分查找的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

我们可以在带有OR选择查询的data.table中进行二分查找 [英] Can we do binary search in data.table with OR select queries

问题描述

说明：

使用其他基准更新

Explanation:

Update with additional benchmarks

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

我们可以在带有OR选择查询的data.table中进行二分查找 [英] Can we do binary search in data.table with OR select queries

问题描述

说明：

使用其他基准更新

Explanation:

Update with additional benchmarks

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭