子集数据表由第2列仅2列键,使用二叉搜索不向量扫描 [英] Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan

查看:100
本文介绍了子集数据表由第2列仅2列键,使用二叉搜索不向量扫描的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在 data.table 中发现了二进制搜索。如果表在多个键上排序,可以只在第二个键上搜索?

I recently discovered binary search in data.table. If the table is sorted on multiple keys it possible to search on the 2nd key only ?

DT = data.table(x=sample(letters,1e7,T),y=sample(1:25,1e7,T),rnorm(1e7))
setkey(DT,x,y)
#R> DT[J('x')]
#        x  y       V3
#     1: x  1  0.89109
#     2: x  1 -2.01457
#    ---              
#384922: x 25  0.09676
#384923: x 25  0.25168
#R> DT[J('x',3)]
#       x y       V3
#    1: x 3 -0.88165
#    2: x 3  1.51028
#   ---             
#15383: x 3 -1.62218
#15384: x 3 -0.63601

感谢@Arun

R> system.time(DT[J(unique(x), 25)])
   user  system elapsed 
  0.220   0.068   0.288 
R> system.time(DT[y==25])
   user  system elapsed 
  0.268   0.092   0.359


推荐答案

是的,您可以将所有值传递给第一个键值,并将子集与第二个键的具体值进行比较。

Yes, you can pass all values to the first key value and subset with the specific value for the second key.

DT[J(unique(x), 25), nomatch=0]

如果您需要在第二个键中包含多个值(例如等效于 DT [y%in%25:24] ) ,更通用的解决方案是使用 CJ

If you need to subset by more than one value in the second key (e.g. the equivalent of DT[y %in% 25:24]), a more general solution is to use CJ

DT[CJ(unique(x), 25:24), nomatch=0]

注意 CJ 默认会对列进行排序并将键设置为所有列,这意味着结果也将排序。如果这不太可取,您应该使用 sorted = FALSE

Note that CJ by default sorts the columns and sets key to all the columns, which means the result would be sorted as well. If that's not desirable, you should use sorted=FALSE

DT[CJ(unique(x), 25:24, sorted=FALSE), nomatch=0]

还有一个功能请求,以在将来添加辅助键到 data.table 。我相信计划是添加一个新的函数 set2key

There's also a feature request to add secondary keys to data.table in future. I believe the plan is to add a new function set2key.

FR#1007内置辅助键

还有 merge ,它有一个 data.table 的方法。它为你构建二级密钥,所以应该比基本合并更快。请参阅?merge.data.table

There is also merge, which has a method for data.table. It builds the secondary key inside it for you, so should be faster than base merge. See ?merge.data.table.

这篇关于子集数据表由第2列仅2列键,使用二叉搜索不向量扫描的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆