我们可以在带有OR选择查询的data.table中进行二分查找 [英] Can we do binary search in data.table with OR select queries
问题描述
在上一个问题之后使用 data.table
DT = data.table(x = 1e7,T),y = sample(1:25,1e7,T),rnorm(1e7))
setkey(DT,x,y)
我们可以使用二进制搜索找到
DT [x == 'a'| y == 25]
请记住, DT [J('a' 25)] == DT [x =='a'& y == 25]
是:
一个二叉serach,我们需要适当的索引。
indx < - rbind(DT [y == 25,list(y = 25),by = x] [indx,setdiff(indx,unique(DT [,key(x)]) (DT),with = FALSE]))
indx
DT [。(indx)]
基准化:
这给予我们超过<强>超过矢量化serach。
相同(setkey(DT [。(indx)]),setkey(DT [x ==a| y = = 25]))
#[1] TRUE
库(microbenchmark)
microbenchmark(UsingIndx = DT [。(indx)],UsingVecSearch = DT [ x ==a| y == 25],times = 100)
单位:毫秒
expr min lq median uq max
1使用Indx 34.27562 41.70119 48.13215 49.29752 231.1669
2 UsingVecSearch 506.62670 545.85673 636.67701 680.93894 802.0842
<
为了方便起见,我们可以将代码的创建索引部分包装到一个好的函数
中,这样我们可以调用它在单行。例如:DT [。(OrIndx(a,25,DT))]
其中
OrIndx()
定义如下:OrIndx < - function(xval,yval,DT){
#TODO:允许任意列和列名称
if ($ is.data.table(DT))
stop(DT不是数据表)
#创建所有适当的组合
indx < - rbind DT [y == yval,list(y = yval),by = x],DT [。(xval),list(x = xval),by = y],use.names = TRUE)
#取出indx中实际不存在于DT中的任何组合并返回
return(setdiff(indx,setdiff(indx,unique(DT [,key(DT),with = FALSE]))))
}
说明:
这里的想法是,执行或操作需要某种形式的组合。
在标准向量搜索中,此组合是每个单独向量搜索的结果。
data.table允许使用
DT [。(c(cdf,tmb),c(25,3))]
$ b b因此,问题的自然解决方案是使用:
DT [。值x>,a),c(25,
唯一的问题是回收不会正确排列。
这将是理想的选择像DT [。(list(c(unique(x),y = 25),c(x =a,y = unique(y)))]
但是据我所知,尚未实现(尚未实现)
因此,我们可以采取适当的组合。
上面的函数OrIndx
确实如此。(快速&脏,有更有效的方式创建索引)
使用其他基准更新
根据@Aruns建议,我们包括
rbind(DT [J(a)],DT [J(setdiff唯一的(x),a),25)])
rbindlist(list(DT [ ]))
在1e6和1e7行测试:
$ b b##使用1百万行
>微基准(Using_Indx = DT [。(indx)],Using_RbindList = rbindlist(list(DT [J(a)],DT [J(setdiff(unique(x),a), Use_Rbind = rbind(DT [J(a)],DT [J(setdiff(unique(x),a),25)]),Using_VecSearch = DT [x ==a| y == 25 ],times = 70L)
单位:毫秒
expr min lq median uq max
1 Using_Indx 4.865089 5.755615 5.813938 5.957352 6.880743
2 Using_Rbind 42.657953 49.239558 49.682407 50.505977 139.770670
3 Using_RbindList 36.319170 44.169151 44.484350 45.279158 155.361338
4 Using_VecSearch 49.003307 64.030384 64.443666 65.123886 150.099946
##使用10 Milliion行
单位:毫秒
expr min lq median uq max
1 Using_Indx 33.71108 47.5402 48.7574 50.75285 122.0950
2 Using_rbind 492.38244 535.6062 565.8623 590.92841 727.3907
3 Using_RbindList 436.29325 478.3626 507.4665 525.25980 657.6639
4 Using_VecSearch 511.86248 607.8046 643.9822 688.36733 773.3997
#确保所有相同的结果:
> (setkey(DT [。(indx)]),setkey(DT [x ==a| y == 25]))
[1] TRUE
>相同的(setkey(DT [。(indx)]),setkey(rbind(DT [J(a)],DT [J(setdiff(unique(x),a),25) b $ b [1] TRUE
请注意,对于SMALL tabbles(小于
15K
行),向量搜索更快(对于真正小的表,大约两倍快)##使用100行
>微基准(Using_Indx = DT [。(indx)],Using_RbindList = rbindlist(list(DT [J(a)],DT [J(setdiff(unique(x),a), Using_rbind = rbind(DT [J(a)],DT [J(setdiff(unique(x),a),25)]),Using_VecSearch = DT [x ==a| y == 25 ],times = 150L)
单位:微秒
expr min lq median uq max
1 Using_Indx 884.819 901.854 917.3715 933.642 9740.046
2 Using_rbind 2385.842 2424.893 2462.5210 2502.704 4266.637
3 Using_RbindList 1962.504 2005.594 2027.4085 2069.516 4238.146
4 Using_VecSearch 386.867 401.328 407.5730 420.647 2908.090
约10,000行,此时我们开始看到增益:
## 10,000行
单位:微秒
expr min lq median uq max
1 Using_Indx 891.374 921.784 931.6585 956.737 3780.971
4 Using_VecSearch 796.316 815.965 824.1480 845.151 2531.314
## 15,000行
单位:微秒
expr min lq median uq max
1 Using_Indx 913.963 939.198 954.518 986.609 2900.174
4 Using_VecSearch 1018.830 1041.449 1053.098 1072.188 8418.470
## 30,000行
单位:微秒
expr min lq median uq max
1 Using_Indx 964.402 995.883 1018.535 1045.908 5999.390
4 Using_VecSearch 1649.231 1709.090 1801.760 1927.976 8868.470
## 100,000行
单位:milliseconds
expr min lq median uq max
1 Using_Indx 1.142318 1.181023 1.198611 1.268417 3.611945
4 Using_VecSearch 4.663948 4.763179 5.052995 6.058354 12.133510
## 10,000,000行(仅运行30
单位:毫秒
expr min lq median uq max
1 Using_Indx 33.95004 42.24995 48.90363 50.15424 177.0991
2 Using_VecSearch 512.34760 557.02867 622.37670 662.14323 861.3465
Following the previous question using
data.table
DT = data.table(x=sample(letters,1e7,T),y=sample(1:25,1e7,T),rnorm(1e7)) setkey(DT,x,y)
Can we use binary search to find
DT[x=='a' | y==25]
Remember that
DT[J('a',25)] == DT[x=='a' & y==25]
解决方案Yes:
In order to do a binary serach, we need the appropriate indices.indx <- rbind(DT[y==25, list(y=25), by=x], DT[.("a"), list(x="a"), by=y], use.names=TRUE) indx <- setdiff(indx, setdiff(indx, unique(DT[, key(DT), with=FALSE]))) indx DT[.(indx)]
Benchmarking:
This gives us more than a 10x improvement over vectorized serach.identical(setkey(DT[.(indx)]), setkey(DT[x=="a" | y == 25])) # [1] TRUE library(microbenchmark) microbenchmark(UsingIndx = DT[.(indx)], UsingVecSearch = DT[x=="a" | y == 25], times=100 ) Unit: milliseconds expr min lq median uq max 1 UsingIndx 34.27562 41.70119 48.13215 49.29752 231.1669 2 UsingVecSearch 506.62670 545.85673 636.67701 680.93894 802.0842
For convenience, we can wrap the "creating the index" portion of the code into a nice function, so that we can then call it in a single line. For example:
DT[.(OrIndx("a", 25, DT))]
Where
OrIndx()
is defined as follows:OrIndx <- function(xval, yval, DT) { # TODO: Allow for arbitrary columns and column names if(!is.data.table(DT)) stop("DT is not a data.table") # create all appropriate combinations indx <- rbind(DT[y==yval, list(y=yval), by=x], DT[.(xval), list(x=xval), by=y], use.names=TRUE) # take out any combinations in indx that are not actually present in DT and return return( setdiff(indx, setdiff(indx, unique(DT[, key(DT), with=FALSE]))) ) }
Explanation:
The idea here is that performing an "or" serach requires some form of combination.
In a standard vector search, this combination is of the results of each individual vector serach.data.table offers some great speed improvements by allowing seraches such as
DT[.(c("cdf", "tmb"), c(25, 3))]
Therefore, a natural solution to the question would be to use:
DT[.(c(<all values of x>, "a"), c(25, <all values of y>))]
The only problem is that the recycling would not line up properly.
It would be ideal to have an option likeDT[.( list( c(unique(x), y=25), c(x="a", y=unique(y) ) )]
But as far as I can tell that has not been implemented (yet!)
So instead, we can take appropriate combinations.
The functionOrIndx
above does exactly that. (it s quick & dirty and there are more efficient ways of creating the index)
Update with additional benchmarks
As per @Aruns suggestion, we include
rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]) rbindlist(list( DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)] ))
Tested on 1e6 and 1e7 rows:
## Using 1 Million rows > microbenchmark(Using_Indx = DT[.(indx)], Using_RbindList = rbindlist(list(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])), Using_Rbind = rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]), Using_VecSearch = DT[x=="a" | y == 25], times=70L ) Unit: milliseconds expr min lq median uq max 1 Using_Indx 4.865089 5.755615 5.813938 5.957352 6.880743 2 Using_Rbind 42.657953 49.239558 49.682407 50.505977 139.770670 3 Using_RbindList 36.319170 44.169151 44.484350 45.279158 155.361338 4 Using_VecSearch 49.003307 64.030384 64.443666 65.123886 150.099946 ## Using 10 Milliion rows Unit: milliseconds expr min lq median uq max 1 Using_Indx 33.71108 47.5402 48.7574 50.75285 122.0950 2 Using_rbind 492.38244 535.6062 565.8623 590.92841 727.3907 3 Using_RbindList 436.29325 478.3626 507.4665 525.25980 657.6639 4 Using_VecSearch 511.86248 607.8046 643.9822 688.36733 765.3997 # Making sure all the same results: > identical(setkey(DT[.(indx)]), setkey(DT[x=="a" | y == 25])) [1] TRUE > identical(setkey(DT[.(indx)]), setkey(rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]))) [1] TRUE
Note that for SMALL tabbles (less than
15K
rows), vector search is faster (for really small tables, about twice as fast)## Using 100 Rows > microbenchmark(Using_Indx = DT[.(indx)], Using_RbindList = rbindlist(list(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])), Using_rbind = rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]), Using_VecSearch = DT[x=="a" | y == 25], times=150L ) Unit: microseconds expr min lq median uq max 1 Using_Indx 884.819 901.854 917.3715 933.642 9740.046 2 Using_rbind 2385.842 2424.893 2462.5210 2502.704 4266.637 3 Using_RbindList 1962.504 2005.594 2027.4085 2069.516 4238.146 4 Using_VecSearch 386.867 401.328 407.5730 420.647 2908.090
This pattern holds until about 10,000 rows, at which point we start to see the gains:
## 10,000 Rows Unit: microseconds expr min lq median uq max 1 Using_Indx 891.374 921.784 931.6585 956.737 3780.971 4 Using_VecSearch 796.316 815.965 824.1480 845.151 2531.314 ## 15,000 Rows Unit: microseconds expr min lq median uq max 1 Using_Indx 913.963 939.198 954.518 986.609 2900.174 4 Using_VecSearch 1018.830 1041.449 1053.098 1072.188 8418.470 ## 30,000 Rows Unit: microseconds expr min lq median uq max 1 Using_Indx 964.402 995.883 1018.535 1045.908 5999.390 4 Using_VecSearch 1649.231 1709.090 1801.760 1927.976 8868.470 ## 100,000 Rows Unit: milliseconds expr min lq median uq max 1 Using_Indx 1.142318 1.181023 1.198611 1.268417 3.611945 4 Using_VecSearch 4.663948 4.763179 5.052995 6.058354 12.133510 ## 10,000,000 Rows (only ran 30 reps for this one) Unit: milliseconds expr min lq median uq max 1 Using_Indx 33.95004 42.24995 48.90363 50.15424 177.0991 2 Using_VecSearch 512.34760 557.02867 622.37670 662.14323 861.3465
这篇关于我们可以在带有OR选择查询的data.table中进行二分查找的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!