两个数字向量上的全对全 setdiff,具有接受匹配的数字阈值 [英] All-to-all setdiff on two numeric vectors with a numeric threshold for accepting matches
问题描述
我想要做的或多或少是以下两个线程中讨论的问题的组合:
What I want to do is more or less a combination of the problems discussed in the two following threads:
我有两个数字向量:
b_1 <- c(543.4591, 489.36325, 12.03, 896.158, 1002.5698, 301.569)
b_2 <- c(22.12, 53, 12.02, 543.4891, 5666.31, 100.1, 896.131, 489.37)
我想将 b_1
中的 所有 元素与 b_2
中的所有元素进行比较,反之亦然.
I want to compare all elements in b_1
against all elements in b_2
and vice versa.
如果 b_1
中的 element_i
是 NOT 等于 范围 中的 any 数字> element_j ± 0.045
in b_2
那么element_i
必须上报.
If element_i
in b_1
is NOT equal to any number in the range element_j ± 0.045
in b_2
then element_i
must be reported.
同样,如果 b_2
中的 element_j
是 NOT 等于 范围内的 any 数 element_i ± 0.045
in b_1
则 element_j
必须上报.
Likewise, if element_j
in b_2
is NOT equal to any number in the range element_i ± 0.045
in b_1
then element_j
must be reported.
因此,基于上面提供的向量的示例答案将是:
Therefore, example answer based on the vectors provided above will be:
### based on threshold = 0.045
in_b1_not_in_b2 <- c(1002.5698, 301.569)
in_b2_not_in_b1 <- c(22.12, 53, 5666.31, 100.1)
是否有 R 函数可以做到这一点?
Is there an R function that would do this?
推荐答案
如果你乐于使用非 base
包,data.table::inrange
是一个方便的功能.
If you are happy to use a non-base
package, data.table::inrange
is a convenient function.
x1[!inrange(x1, x2 - 0.045, x2 + 0.045)]
# [1] 1002.570 301.569
x2[!inrange(x2, x1 - 0.045, x1 + 0.045)]
# [1] 22.12 53.00 5666.31 100.10
<小时>
inrange
在更大的数据集上也很有效.例如在1e5
向量,inrange
是 >比其他两种替代方案快 700
倍:
inrange
is also efficient on larger data sets. On e.g. 1e5
vectors, inrange
is > 700
times faster than the two other alternatives:
n <- 1e5
b1 <- runif(n, 0, 10000)
b2 <- b1 + runif(n, -1, 1)
microbenchmark(
f1 = f(b1, b2, 0.045, 5000),
f2 = list(in_b1_not_in_b2 = b1[sapply(b1, function(x) !any(abs(x - b2) <= 0.045))],
in_b2_not_in_b1 = b2[sapply(b2, function(x) !any(abs(x - b1) <= 0.045))]),
f3 = list(in_b1_not_in_b2 = b1[!inrange(b1, b2 - 0.045, b2 + 0.045)],
in_b2_not_in_b1 = b2[!inrange(b2, b1 - 0.045, b1 + 0.045)]),
unit = "relative", times = 10)
# Unit: relative
# expr min lq mean median uq max neval
# f1 1976.931 1481.324 1269.393 1103.567 1173.3017 1060.2435 10
# f2 1347.114 1027.682 858.908 766.773 754.7606 700.0702 10
# f3 1.000 1.000 1.000 1.000 1.0000 1.0000 10
<小时>
是的,它们给出了相同的结果:
And yes, they give the same result:
n <- 100
b1 <- runif(n, 0, 10000)
b2 <- b1 + runif(n, -1, 1)
all.equal(f(b1, b2, 0.045, 5000),
list(in_b1_not_in_b2 = b1[sapply(b1, function(x) !any(abs(x - b2) <= 0.045))],
in_b2_not_in_b1 = b2[sapply(b2, function(x) !any(abs(x - b1) <= 0.045))]))
# TRUE
all.equal(f(b1, b2, 0.045, 5000),
list(in_b1_not_in_b2 = b1[!inrange(b1, b2 - 0.045, b2 + 0.045)],
in_b2_not_in_b1 = b2[!inrange(b2, b1 - 0.045, b1 + 0.045)]))
# TRUE
<小时>
当 搜索 inrange
时,有几个相关的、可能有用的答案所以.
Several related, potentially useful answers when searching for inrange
on SO.
这篇关于两个数字向量上的全对全 setdiff,具有接受匹配的数字阈值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!