2个逻辑向量的元素之间的快速最小距离(间隔)(取2) [英] Fast minimum distance (interval) between elements of 2 logical vectors (take 2)

查看:65
本文介绍了2个逻辑向量的元素之间的快速最小距离(间隔)(取2)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里问了一个相关问题,但我意识到我计算这个复杂的度量花费了太多时间(目标是与随机化测试一起使用,所以速度是一个问题).所以我决定放弃权重,只使用两个度量之间的最小距离.所以这里我有 2 个向量(在一个用于演示目的的数据框中,但实际上它们是两个向量.

I asked a related question here but realized I was burning too much time calculating this complex measure (And the goal is to use with a randomization test so speed is an issue). So I've decided to throw out the weighting and just use the minimum distance between two measures. So here I have 2 vectors (in a data frame for demo purposes but in reality they are two vectors.

       x     y
1  FALSE  TRUE
2  FALSE FALSE
3   TRUE FALSE
4  FALSE FALSE
5  FALSE  TRUE
6  FALSE FALSE
7  FALSE FALSE
8   TRUE FALSE
9  FALSE  TRUE
10  TRUE  TRUE
11 FALSE FALSE
12 FALSE FALSE
13 FALSE FALSE
14 FALSE  TRUE
15  TRUE FALSE
16 FALSE FALSE
17  TRUE  TRUE
18 FALSE  TRUE
19 FALSE FALSE
20 FALSE  TRUE
21 FALSE FALSE
22 FALSE FALSE
23 FALSE FALSE
24 FALSE FALSE
25  TRUE FALSE

这里我有一些代码可以找到最小距离,但我需要更快的速度(去除不必要的调用和更好的矢量化).也许我不能在基础 R 中走得更快.

Here I have some code worked out to find the minimal distance but I need more speed (removal of unnecessary calls and better vectorization). Maybe I can't go any faster in base R.

## MWE EXAMPLE: THE DATA
x <- y <- rep(FALSE, 25)
x[c(3, 8, 10, 15, 17, 25)] <- TRUE
y[c(1, 5, 9, 10, 14, 17, 18, 20)] <- TRUE

## Code to Find Distances
xw <- which(x)
yw <- which(y)

min_dist <- function(xw, yw) {
    unlist(lapply(xw, function(x) {
        min(abs(x - yw))
    }))
}

min_dist(xw, yw)

有没有办法提高基础 R 的性能?使用 dplyr 还是 data.table?

Is there any way to improve performance in base R? Using dplyr or data.table?

我的向量要长得多(10,000 + 个元素).

My vectors are much longer (10,000 + elements).

编辑每个 flodel 的工作台.我在 MWE 中预料到了一个问题,但我也不知道如何解决它.如果任何 x 位置小于最小 y 位置,就会出现问题.

Edit per flodel's benching. flodel there's an issue I had anticipated in my MWE and I'm not sure how to fix it either. The problem arises if any x position is less than the minimum y position.

x <- y <- rep(FALSE, 25)
x[c(3, 8, 9, 15, 17, 25)] <- TRUE
y[c(5, 9, 10, 13, 15, 17, 19)] <- TRUE


xw <- which(x)
yw <- which(y)

flodel <- function(xw, yw) {
   i <- findInterval(xw, yw)
   pmin(xw - yw[i], yw[i+1L] - xw, na.rm = TRUE)
}

flodel(xw, yw)

## [1] -2 -1 -6 -2 -2 20
## Warning message:
## In xw - yw[i] :
##   longer object length is not a multiple of shorter object length

推荐答案

flodel <- function(x, y) {
  xw <- which(x)
  yw <- which(y)
  i <- findInterval(xw, yw, all.inside = TRUE)
  pmin(abs(xw - yw[i]), abs(xw - yw[i+1L]), na.rm = TRUE)
}

GG1 <- function(x, y) {
  require(zoo)
  yy <- ifelse(y, TRUE, NA) * seq_along(y)
  fwd <- na.locf(yy, fromLast = FALSE)[x]
  bck <- na.locf(yy, fromLast = TRUE)[x]
  wx <- which(x)
  pmin(wx - fwd, bck - wx, na.rm = TRUE)
}

GG2 <- function(x, y) {
  require(data.table)
  dtx <- data.table(x = which(x))
  dty <- data.table(y = which(y), key = "y")
  dty[dtx, abs(x - y), roll = "nearest"] 
}

示例数据:

x <- y <- rep(FALSE, 25)
x[c(3, 8, 10, 15, 17, 25)] <- TRUE
y[c(1, 5, 9, 10, 14, 17, 18, 20)] <- TRUE

X <- rep(x, 100)
Y <- rep(y, 100)

单元测试:

identical(flodel(X, Y), GG1(X, Y))
# [1] TRUE

基准:

library(microbenchmark)
microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y))
# Unit: microseconds
#          expr       min         lq     median        uq        max neval
#  flodel(X, Y)   115.546   131.8085   168.2705   189.069   1980.316   100
#     GG1(X, Y)  2568.045  2828.4155  3009.2920  3376.742  63870.137   100
#     GG2(X, Y) 22210.708 22977.7340 24695.7225 28249.410 172074.881   100

[Matt Dowle 编辑] 24695 微秒 = 0.024 秒.对具有微小数据的微基准进行的推论很少能保持有意义的数据大小.

24695 microseconds = 0.024 seconds. Inferences made on microbenchmarks with tiny data rarely hold on meaningful data sizes.

[由 flodel 编辑] 我的向量长度为​​ 2500,考虑到 Tyler 的声明 (10k),这是相当有意义的,但是很好,让我们尝试使用长度为 2.5e7 的向量.鉴于情况,我希望你能原谅我使用 system.time :

My vectors had length 2500 which was rather meaningful given Tyler's statement (10k), but fine, let's try with vectors of length 2.5e7. I hope you will forgive me for using system.time given the circumstances:

X <- rep(x, 1e6)
Y <- rep(y, 1e6)
system.time(flodel(X,Y))
#    user  system elapsed 
#   0.694   0.205   0.899 
system.time(GG1(X,Y))
#    user  system elapsed 
#  31.250  16.496 112.967 
system.time(GG2(X,Y))
# Error in `[.data.table`(dty, dtx, abs(x - y), roll = "nearest") : 
#   negative length vectors are not allowed

<小时>

[来自 Arun 的编辑] - 使用 1.8.11 的 2.5e7 基准测试:
[来自 Arun 的编辑 2] - 在 Matt 最近更快二分搜索/合并

require(data.table)
arun <- function(x, y) {
    dtx <- data.table(x=which(x))
    setattr(dtx, 'sorted', 'x')
    dty <- data.table(y=which(y))
    setattr(dty, 'sorted', 'y')
    dty[, y1 := y]
    dty[dtx, roll="nearest"][, abs(y-y1)]
}

# minimum of three consecutive runs
system.time(ans1 <- arun(X,Y))
#   user  system elapsed 
#  1.036   0.138   1.192 

# minimum of three consecutive runs
system.time(ans2 <- flodel(X,Y))
#   user  system elapsed 
#  0.983   0.197   1.221 

identical(ans1, ans2) # [1] TRUE

这篇关于2个逻辑向量的元素之间的快速最小距离(间隔)(取2)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆