使用大于,小于使用索引的数据表的有效子集 [英] efficient subsetting of data.table with greater-than, less-than using indices
问题描述
我正在尝试在R中使用 data.table
使用大于和小于这样的有效子集:
I'm trying to use data.table
in R for efficient subsetting using greater-than and less-than like that:
library(data.table)
x = runif(10000, min = 1, max = 2)
rowname = seq(10000)
min.x = x - 0.0001
max.x = x + 0.0001
table = data.table(rowname, min.x, max.x)
system.time(x.candidates <- lapply(x, function(x) {table[x > min.x & x < max.x, rowname]}))
# -> user system elapsed
# 4.87 0.00 4.90
table2 = data.table(rowname, min.x, max.x)
setindex(table2, min.x)
setindex(table2, max.x)
system.time(x.candidates2 <- lapply(x, function(x) {table2[x > min.x & x < max.x, rowname]}))
# -> user system elapsed
# 4.90 0.00 4.92
table3 = data.frame(rowname, min.x, max.x)
system.time(x.candidates3 <- lapply(x, function(x) {table3[x > table3$min.x & x < table3$max.x, "rowname"]}))
# -> user system elapsed
# 1.77 0.00 1.78
但是,设置索引和data.frame甚至更快。
However, I see not speedup when setting indices and data.frame is even faster. Is it even possible to write this code more efficient in data.table or R in general?
最佳解决方案
@eddi指出,这是使用.EACHI的正确方法:
As @eddi pointed out, this is the correct way using .EACHI:
table4 = data.table(rowname, min.x, max.x)
system.time(x.candidates4 <- table4[data.table(x), on = .(min.x < x, max.x > x), list(rowname = list(rowname)), by = .EACHI])
# user system elapsed
# 0.02 0.00 0.01
推荐答案
您做错了。循环调用 [。data.table
(这就是您的 lapply
所做的事情)会很慢,因为那样函数有很多开销,对于您执行的微小操作而言,开销是不值得的。正确的方法是进行非等价联接:
You're doing it wrong. Calling [.data.table
in a loop, which is what your lapply
does, is going to be slow because that function has a lot of overhead, and that overhead is not worth it for the tiny operation that you do. The correct way is to do a non-equi join:
table[data.table(x), on = .(min.x < x, max.x > x), rowname, by = .EACHI]
# min.x max.x rowname
# 1: 1.084668 1.084668 1
# 2: 1.293461 1.293461 7734
# 3: 1.293461 1.293461 739
# 4: 1.293461 1.293461 2
# 5: 1.293461 1.293461 3757
# ---
#30216: 1.324366 1.324366 9999
#30217: 1.324366 1.324366 9635
#30218: 1.869469 1.869469 8740
#30219: 1.869469 1.869469 3302
#30220: 1.869469 1.869469 10000
以上是瞬时的。当前的列命名有点不幸(有一个可解决的FR)-想象前两列被命名为 x
应该会增加清晰度。
The above is instantaneous. Current column naming is a bit unfortunate (there is an FR to fix that) - imagining first two columns being named x
should add more clarity.
这篇关于使用大于,小于使用索引的数据表的有效子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!