在没有向量扫描的情况下查找给定间隔中的值 [英] Find values in a given interval without a vector scan

查看:94
本文介绍了在没有向量扫描的情况下查找给定间隔中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用R包 data.table 可以找到给定间隔中的值,而无需对数据进行完整的矢量扫描。例如

With a the R package data.table is it possible to find the values that are in a given interval without a full vector scan of the data. For example

>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
   x
1: 3
2: 5
3: 8

其中 DT 可以是一个很大的表。

Where DT can be a very big table.

奖金问题:
是否有可能在一组不重叠的时间间隔内执行相同的操作例如

Bonus question: is it possible to do the same thing for a set of non-overlapping intervals such as

>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
   i min max
1: 1   3  10
2: 2  20  40
> my.data.table.function2(DT,I)
   i  x
1: 1  3
2: 1  5
3: 1  8
4: 2 21
5: 2 34

其中两个 I DT 可能很大。
非常感谢

Where both I and DT can be very big. Thanks a lot

推荐答案

首先, vecseq 不会从 data.table 导出为可见函数,因此此处的语法和/或行为可能会更改,而不会在将来对该软件包进行更新时发出警告。此外,除了最后简单的相同支票之外,这是 unested

First of all, vecseq isn't exported as a visible function from data.table, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical check at the end.

顺便说一句,我们需要一个更大的例子来展示与矢量扫描方法的区别:

That out of the way, we need a bigger example to exhibit difference from vector scan approach:

require(data.table)

n <- 1e5L
f <- 10L
ni <- n / f

set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni, 
                 min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
                 max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))

DT ,则数据表是 1:n 的非随机子集。 IT ,间隔表为 1中的 ni = n / 10 个非重叠间隔: n 。对所有 ni 个间隔进行重复矢量扫描需要一段时间:

DT, the Data Table is a not-too-random subset of 1:n. IT, the Interval Table is ni = n / 10 non-overlapping intervals in 1:n. Doing the repeated vector scan on all ni intervals takes a while:

system.time({
  ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
 ##  user  system elapsed 
 ## 84.15    4.48   88.78

一个时间间隔可以进行两次滚动连接端点(请参阅?data.table 中的 roll 参数)一举获得所有内容:

One can do two rolling joins on the interval endpoints (see the roll argument in ?data.table) to get everything in one swoop:

system.time({
  # Save time if DT is already keyed correctly
  if(!identical(key(DT), "x")) setkey(DT, x)

  DT[, row := .I]

  setkey(IT, min)

  target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]

  # Non-overlapping intervals => (sorted by min => sorted by max)
  setattr(IT, "sorted", "max")

  target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]

  target <- target.low[target.high, nomatch = 0]
  target[, len := max - min + 1L]


  rm(target.low, target.high)

  ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
  ans.roll[, row := NULL]
  setcolorder(ans.roll, c("i", "x"))
})
 ## user  system elapsed 
 ## 0.12    0.00    0.12

确保相同的行顺序可验证结果:

Ensuring the same row order verifies the result:

setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE

这篇关于在没有向量扫描的情况下查找给定间隔中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆