在没有向量扫描的情况下查找给定间隔中的值 [英] Find values in a given interval without a vector scan
问题描述
使用R包 data.table
可以找到给定间隔中的值,而无需对数据进行完整的矢量扫描。例如
With a the R package data.table
is it possible to find the values that are in a given interval without a full vector scan of the data. For example
>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
x
1: 3
2: 5
3: 8
其中 DT
可以是一个很大的表。
Where DT
can be a very big table.
奖金问题:
是否有可能在一组不重叠的时间间隔内执行相同的操作例如
Bonus question: is it possible to do the same thing for a set of non-overlapping intervals such as
>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
i min max
1: 1 3 10
2: 2 20 40
> my.data.table.function2(DT,I)
i x
1: 1 3
2: 1 5
3: 1 8
4: 2 21
5: 2 34
其中两个 I
和 DT
可能很大。
非常感谢
Where both I
and DT
can be very big.
Thanks a lot
推荐答案
首先, vecseq
不会从 data.table
导出为可见函数,因此此处的语法和/或行为可能会更改,而不会在将来对该软件包进行更新时发出警告。此外,除了最后简单的相同
支票之外,这是 unested 。
First of all, vecseq
isn't exported as a visible function from data.table
, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical
check at the end.
顺便说一句,我们需要一个更大的例子来展示与矢量扫描方法的区别:
That out of the way, we need a bigger example to exhibit difference from vector scan approach:
require(data.table)
n <- 1e5L
f <- 10L
ni <- n / f
set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni,
min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))
DT
,则数据表是 1:n
的非太随机子集。 IT
,间隔表为 1中的
。对所有 ni = n / 10
个非重叠间隔: n ni
个间隔进行重复矢量扫描需要一段时间:
DT
, the Data Table is a not-too-random subset of 1:n
. IT
, the Interval Table is ni = n / 10
non-overlapping intervals in 1:n
. Doing the repeated vector scan on all ni
intervals takes a while:
system.time({
ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
## user system elapsed
## 84.15 4.48 88.78
一个时间间隔可以进行两次滚动连接端点(请参阅?data.table
中的 roll
参数)一举获得所有内容:
One can do two rolling joins on the interval endpoints (see the roll
argument in ?data.table
) to get everything in one swoop:
system.time({
# Save time if DT is already keyed correctly
if(!identical(key(DT), "x")) setkey(DT, x)
DT[, row := .I]
setkey(IT, min)
target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]
# Non-overlapping intervals => (sorted by min => sorted by max)
setattr(IT, "sorted", "max")
target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
ans.roll[, row := NULL]
setcolorder(ans.roll, c("i", "x"))
})
## user system elapsed
## 0.12 0.00 0.12
确保相同的行顺序可验证结果:
Ensuring the same row order verifies the result:
setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE
这篇关于在没有向量扫描的情况下查找给定间隔中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!