2个逻辑向量的元素之间的快速最小距离(间隔)(取2) [英] Fast minimum distance (interval) between elements of 2 logical vectors (take 2)
问题描述
我在这里问了一个相关问题,但我意识到我计算这个复杂的度量花费了太多时间(目标是与随机化测试一起使用,所以速度是一个问题).所以我决定放弃权重,只使用两个度量之间的最小距离.所以这里我有 2 个向量(在一个用于演示目的的数据框中,但实际上它们是两个向量.
I asked a related question here but realized I was burning too much time calculating this complex measure (And the goal is to use with a randomization test so speed is an issue). So I've decided to throw out the weighting and just use the minimum distance between two measures. So here I have 2 vectors (in a data frame for demo purposes but in reality they are two vectors.
x y
1 FALSE TRUE
2 FALSE FALSE
3 TRUE FALSE
4 FALSE FALSE
5 FALSE TRUE
6 FALSE FALSE
7 FALSE FALSE
8 TRUE FALSE
9 FALSE TRUE
10 TRUE TRUE
11 FALSE FALSE
12 FALSE FALSE
13 FALSE FALSE
14 FALSE TRUE
15 TRUE FALSE
16 FALSE FALSE
17 TRUE TRUE
18 FALSE TRUE
19 FALSE FALSE
20 FALSE TRUE
21 FALSE FALSE
22 FALSE FALSE
23 FALSE FALSE
24 FALSE FALSE
25 TRUE FALSE
这里我有一些代码可以找到最小距离,但我需要更快的速度(去除不必要的调用和更好的矢量化).也许我不能在基础 R 中走得更快.
Here I have some code worked out to find the minimal distance but I need more speed (removal of unnecessary calls and better vectorization). Maybe I can't go any faster in base R.
## MWE EXAMPLE: THE DATA
x <- y <- rep(FALSE, 25)
x[c(3, 8, 10, 15, 17, 25)] <- TRUE
y[c(1, 5, 9, 10, 14, 17, 18, 20)] <- TRUE
## Code to Find Distances
xw <- which(x)
yw <- which(y)
min_dist <- function(xw, yw) {
unlist(lapply(xw, function(x) {
min(abs(x - yw))
}))
}
min_dist(xw, yw)
有没有办法提高基础 R 的性能?使用 dplyr
还是 data.table
?
Is there any way to improve performance in base R? Using dplyr
or data.table
?
我的向量要长得多(10,000 + 个元素).
My vectors are much longer (10,000 + elements).
编辑每个 flodel 的工作台.我在 MWE 中预料到了一个问题,但我也不知道如何解决它.如果任何 x 位置小于最小 y 位置,就会出现问题.
Edit per flodel's benching. flodel there's an issue I had anticipated in my MWE and I'm not sure how to fix it either. The problem arises if any x position is less than the minimum y position.
x <- y <- rep(FALSE, 25)
x[c(3, 8, 9, 15, 17, 25)] <- TRUE
y[c(5, 9, 10, 13, 15, 17, 19)] <- TRUE
xw <- which(x)
yw <- which(y)
flodel <- function(xw, yw) {
i <- findInterval(xw, yw)
pmin(xw - yw[i], yw[i+1L] - xw, na.rm = TRUE)
}
flodel(xw, yw)
## [1] -2 -1 -6 -2 -2 20
## Warning message:
## In xw - yw[i] :
## longer object length is not a multiple of shorter object length
推荐答案
flodel <- function(x, y) {
xw <- which(x)
yw <- which(y)
i <- findInterval(xw, yw, all.inside = TRUE)
pmin(abs(xw - yw[i]), abs(xw - yw[i+1L]), na.rm = TRUE)
}
GG1 <- function(x, y) {
require(zoo)
yy <- ifelse(y, TRUE, NA) * seq_along(y)
fwd <- na.locf(yy, fromLast = FALSE)[x]
bck <- na.locf(yy, fromLast = TRUE)[x]
wx <- which(x)
pmin(wx - fwd, bck - wx, na.rm = TRUE)
}
GG2 <- function(x, y) {
require(data.table)
dtx <- data.table(x = which(x))
dty <- data.table(y = which(y), key = "y")
dty[dtx, abs(x - y), roll = "nearest"]
}
示例数据:
x <- y <- rep(FALSE, 25)
x[c(3, 8, 10, 15, 17, 25)] <- TRUE
y[c(1, 5, 9, 10, 14, 17, 18, 20)] <- TRUE
X <- rep(x, 100)
Y <- rep(y, 100)
单元测试:
identical(flodel(X, Y), GG1(X, Y))
# [1] TRUE
基准:
library(microbenchmark)
microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y))
# Unit: microseconds
# expr min lq median uq max neval
# flodel(X, Y) 115.546 131.8085 168.2705 189.069 1980.316 100
# GG1(X, Y) 2568.045 2828.4155 3009.2920 3376.742 63870.137 100
# GG2(X, Y) 22210.708 22977.7340 24695.7225 28249.410 172074.881 100
[Matt Dowle 编辑] 24695 微秒 = 0.024 秒.对具有微小数据的微基准进行的推论很少能保持有意义的数据大小.
24695 microseconds = 0.024 seconds. Inferences made on microbenchmarks with tiny data rarely hold on meaningful data sizes.
[由 flodel 编辑] 我的向量长度为 2500,考虑到 Tyler 的声明 (10k),这是相当有意义的,但是很好,让我们尝试使用长度为 2.5e7 的向量.鉴于情况,我希望你能原谅我使用 system.time
:
My vectors had length 2500 which was rather meaningful given Tyler's statement (10k), but fine, let's try with vectors of length 2.5e7. I hope you will forgive me for using system.time
given the circumstances:
X <- rep(x, 1e6)
Y <- rep(y, 1e6)
system.time(flodel(X,Y))
# user system elapsed
# 0.694 0.205 0.899
system.time(GG1(X,Y))
# user system elapsed
# 31.250 16.496 112.967
system.time(GG2(X,Y))
# Error in `[.data.table`(dty, dtx, abs(x - y), roll = "nearest") :
# negative length vectors are not allowed
<小时>
[来自 Arun 的编辑] - 使用 1.8.11 的 2.5e7 基准测试:
[来自 Arun 的编辑 2] - 在 Matt 最近更快二分搜索/合并
require(data.table)
arun <- function(x, y) {
dtx <- data.table(x=which(x))
setattr(dtx, 'sorted', 'x')
dty <- data.table(y=which(y))
setattr(dty, 'sorted', 'y')
dty[, y1 := y]
dty[dtx, roll="nearest"][, abs(y-y1)]
}
# minimum of three consecutive runs
system.time(ans1 <- arun(X,Y))
# user system elapsed
# 1.036 0.138 1.192
# minimum of three consecutive runs
system.time(ans2 <- flodel(X,Y))
# user system elapsed
# 0.983 0.197 1.221
identical(ans1, ans2) # [1] TRUE
这篇关于2个逻辑向量的元素之间的快速最小距离(间隔)(取2)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!