仅通过 2 列键的第 2 列对 data.table 进行子集,使用二进制搜索而不是矢量扫描 [英] Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan

查看:8
本文介绍了仅通过 2 列键的第 2 列对 data.table 进行子集,使用二进制搜索而不是矢量扫描的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在 data.table 中发现了二进制搜索.如果表格按多个键排序,是否只能搜索第二个键?

I recently discovered binary search in data.table. If the table is sorted on multiple keys it possible to search on the 2nd key only ?

DT = data.table(x=sample(letters,1e7,T),y=sample(1:25,1e7,T),rnorm(1e7))
setkey(DT,x,y)
#R> DT[J('x')]
#        x  y       V3
#     1: x  1  0.89109
#     2: x  1 -2.01457
#    ---              
#384922: x 25  0.09676
#384923: x 25  0.25168
#R> DT[J('x',3)]
#       x y       V3
#    1: x 3 -0.88165
#    2: x 3  1.51028
#   ---             
#15383: x 3 -1.62218
#15384: x 3 -0.63601

感谢@Arun

R> system.time(DT[J(unique(x), 25)])
   user  system elapsed 
  0.220   0.068   0.288 
R> system.time(DT[y==25])
   user  system elapsed 
  0.268   0.092   0.359

推荐答案

是的,您可以将所有值传递给第一个键值和具有第二个键的特定值的子集.

Yes, you can pass all values to the first key value and subset with the specific value for the second key.

DT[J(unique(x), 25), nomatch=0]

如果您需要对第二个键中的多个值进行子集化(例如,相当于 DT[y %in% 25:24]),更通用的解决方案是使用 CJ

If you need to subset by more than one value in the second key (e.g. the equivalent of DT[y %in% 25:24]), a more general solution is to use CJ

DT[CJ(unique(x), 25:24), nomatch=0]

注意 CJ 由default 对列进行排序并为所有列设置键,这意味着结果也将被排序.如果这不是可取的,您应该使用 sorted=FALSE

Note that CJ by default sorts the columns and sets key to all the columns, which means the result would be sorted as well. If that's not desirable, you should use sorted=FALSE

DT[CJ(unique(x), 25:24, sorted=FALSE), nomatch=0]

还有一个功能请求,希望将来向 data.table 添加辅助键.我相信计划是添加一个新功能 set2key.

There's also a feature request to add secondary keys to data.table in future. I believe the plan is to add a new function set2key.

FR#1007 内置辅助键

还有merge,里面有data.table的方法.它为您在其中构建辅助键,因此应该比基本合并更快.请参阅 ?merge.data.table.

There is also merge, which has a method for data.table. It builds the secondary key inside it for you, so should be faster than base merge. See ?merge.data.table.

这篇关于仅通过 2 列键的第 2 列对 data.table 进行子集,使用二进制搜索而不是矢量扫描的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆