仅通过 2 列键的第 2 列对 data.table 进行子集,使用二进制搜索而不是矢量扫描 [英] Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
问题描述
我最近在 data.table
中发现了二进制搜索.如果表格按多个键排序,是否只能搜索第二个键?
I recently discovered binary search in data.table
. If the table is sorted on multiple keys it possible to search on the 2nd key only ?
DT = data.table(x=sample(letters,1e7,T),y=sample(1:25,1e7,T),rnorm(1e7))
setkey(DT,x,y)
#R> DT[J('x')]
# x y V3
# 1: x 1 0.89109
# 2: x 1 -2.01457
# ---
#384922: x 25 0.09676
#384923: x 25 0.25168
#R> DT[J('x',3)]
# x y V3
# 1: x 3 -0.88165
# 2: x 3 1.51028
# ---
#15383: x 3 -1.62218
#15384: x 3 -0.63601
感谢@Arun
R> system.time(DT[J(unique(x), 25)])
user system elapsed
0.220 0.068 0.288
R> system.time(DT[y==25])
user system elapsed
0.268 0.092 0.359
推荐答案
是的,您可以将所有值传递给第一个键值和具有第二个键的特定值的子集.
Yes, you can pass all values to the first key value and subset with the specific value for the second key.
DT[J(unique(x), 25), nomatch=0]
如果您需要对第二个键中的多个值进行子集化(例如,相当于 DT[y %in% 25:24]
),更通用的解决方案是使用 CJ
If you need to subset by more than one value in the second key (e.g. the equivalent of DT[y %in% 25:24]
), a more general solution is to use CJ
DT[CJ(unique(x), 25:24), nomatch=0]
注意 CJ
由default 对列进行排序并为所有列设置键,这意味着结果也将被排序.如果这不是可取的,您应该使用 sorted=FALSE
Note that CJ
by default sorts the columns and sets key to all the columns, which means the result would be sorted as well. If that's not desirable, you should use sorted=FALSE
DT[CJ(unique(x), 25:24, sorted=FALSE), nomatch=0]
还有一个功能请求,希望将来向 data.table
添加辅助键.我相信计划是添加一个新功能 set2key
.
There's also a feature request to add secondary keys to data.table
in future. I believe the plan is to add a new function set2key
.
还有merge
,里面有data.table
的方法.它为您在其中构建辅助键,因此应该比基本合并更快.请参阅 ?merge.data.table
.
There is also merge
, which has a method for data.table
. It builds the secondary key inside it for you, so should be faster than base merge. See ?merge.data.table
.
这篇关于仅通过 2 列键的第 2 列对 data.table 进行子集,使用二进制搜索而不是矢量扫描的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!