子集数据表由第2列仅2列键,使用二叉搜索不向量扫描 [英] Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
问题描述
我最近在 data.table
中发现了二进制搜索。如果表在多个键上排序,可以只在第二个键上搜索?
I recently discovered binary search in data.table
. If the table is sorted on multiple keys it possible to search on the 2nd key only ?
DT = data.table(x=sample(letters,1e7,T),y=sample(1:25,1e7,T),rnorm(1e7))
setkey(DT,x,y)
#R> DT[J('x')]
# x y V3
# 1: x 1 0.89109
# 2: x 1 -2.01457
# ---
#384922: x 25 0.09676
#384923: x 25 0.25168
#R> DT[J('x',3)]
# x y V3
# 1: x 3 -0.88165
# 2: x 3 1.51028
# ---
#15383: x 3 -1.62218
#15384: x 3 -0.63601
感谢@Arun
R> system.time(DT[J(unique(x), 25)])
user system elapsed
0.220 0.068 0.288
R> system.time(DT[y==25])
user system elapsed
0.268 0.092 0.359
推荐答案
是的,您可以将所有值传递给第一个键值,并将子集与第二个键的具体值进行比较。
Yes, you can pass all values to the first key value and subset with the specific value for the second key.
DT[J(unique(x), 25), nomatch=0]
如果您需要在第二个键中包含多个值(例如等效于 DT [y%in%25:24]
) ,更通用的解决方案是使用 CJ
If you need to subset by more than one value in the second key (e.g. the equivalent of DT[y %in% 25:24]
), a more general solution is to use CJ
DT[CJ(unique(x), 25:24), nomatch=0]
注意, CJ
默认会对列进行排序并将键设置为所有列,这意味着结果也将排序。如果这不太可取,您应该使用 sorted = FALSE
Note that CJ
by default sorts the columns and sets key to all the columns, which means the result would be sorted as well. If that's not desirable, you should use sorted=FALSE
DT[CJ(unique(x), 25:24, sorted=FALSE), nomatch=0]
还有一个功能请求,以在将来添加辅助键到 data.table
。我相信计划是添加一个新的函数 set2key
。
There's also a feature request to add secondary keys to data.table
in future. I believe the plan is to add a new function set2key
.
还有 merge
,它有一个 data.table
的方法。它为你构建二级密钥,所以应该比基本合并更快。请参阅?merge.data.table
。
There is also merge
, which has a method for data.table
. It builds the secondary key inside it for you, so should be faster than base merge. See ?merge.data.table
.
这篇关于子集数据表由第2列仅2列键,使用二叉搜索不向量扫描的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!