使用大于,小于使用索引的数据表的有效子集 [英] efficient subsetting of data.table with greater-than, less-than using indices

查看:102
本文介绍了使用大于,小于使用索引的数据表的有效子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在R中使用 data.table 使用大于和小于这样的有效子集:

I'm trying to use data.table in R for efficient subsetting using greater-than and less-than like that:

library(data.table)

x = runif(10000, min = 1, max = 2)

rowname = seq(10000)
min.x = x - 0.0001
max.x = x + 0.0001

table = data.table(rowname, min.x, max.x)
system.time(x.candidates <- lapply(x, function(x) {table[x > min.x & x < max.x, rowname]}))

#    ->    user  system elapsed 
#       4.87    0.00    4.90 

table2 = data.table(rowname, min.x, max.x)
setindex(table2, min.x)
setindex(table2, max.x)
system.time(x.candidates2 <- lapply(x, function(x) {table2[x > min.x & x < max.x, rowname]}))

#    -> user  system elapsed 
#       4.90    0.00    4.92 

table3 = data.frame(rowname, min.x, max.x)
system.time(x.candidates3 <- lapply(x, function(x) {table3[x > table3$min.x & x < table3$max.x, "rowname"]}))

#    ->    user  system elapsed 
#       1.77    0.00    1.78

但是,设置索引和data.frame甚至更快。

However, I see not speedup when setting indices and data.frame is even faster. Is it even possible to write this code more efficient in data.table or R in general?

最佳解决方案

@eddi指出,这是使用.EACHI的正确方法:

As @eddi pointed out, this is the correct way using .EACHI:

table4 = data.table(rowname, min.x, max.x)
system.time(x.candidates4 <- table4[data.table(x), on = .(min.x < x, max.x > x), list(rowname = list(rowname)), by = .EACHI])

#   user  system elapsed 
#   0.02    0.00    0.01 


推荐答案

您做错了。循环调用 [。data.table (这就是您的 lapply 所做的事情)会很慢,因为那样函数有很多开销,对于您执行的微小操作而言,开销是不值得的。正确的方法是进行非等价联接:

You're doing it wrong. Calling [.data.table in a loop, which is what your lapply does, is going to be slow because that function has a lot of overhead, and that overhead is not worth it for the tiny operation that you do. The correct way is to do a non-equi join:

table[data.table(x), on = .(min.x < x, max.x > x), rowname, by = .EACHI]
#          min.x    max.x rowname
#    1: 1.084668 1.084668       1
#    2: 1.293461 1.293461    7734
#    3: 1.293461 1.293461     739
#    4: 1.293461 1.293461       2
#    5: 1.293461 1.293461    3757
#   ---                          
#30216: 1.324366 1.324366    9999
#30217: 1.324366 1.324366    9635
#30218: 1.869469 1.869469    8740
#30219: 1.869469 1.869469    3302
#30220: 1.869469 1.869469   10000

以上是瞬时的。当前的列命名有点不幸(有一个可解决的FR)-想象前两列被命名为 x 应该会增加清晰度。

The above is instantaneous. Current column naming is a bit unfortunate (there is an FR to fix that) - imagining first two columns being named x should add more clarity.

这篇关于使用大于,小于使用索引的数据表的有效子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆