将两个非常大的向量与容差匹配(快速！但要节省工作空间) [英] Matching two very very large vectors with tolerance (fast! but working space sparing)

查看：123 发布时间：2020/5/6 9:35:38 r vector matching

本文介绍了将两个非常大的向量与容差匹配(快速！但要节省工作空间)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

考虑我有两个向量.一个是参考向量/列表，其中包括所有感兴趣的值和一个样本矢量，其中可能包含任何可能的值.现在，我想在参考列表中找到具有一定公差的样本匹配项，该公差不是固定的，并且取决于向量中的比较值:

consider I have two vectors. One is a reference vector/list that includes all values of interest and one samplevector that could contain any possible value. Now I want to find matches of my sample inside the reference list with a certain tolerance which is not fixed and depentent on the comparing values inside the vectors:

matches: abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

将两个向量都舍入是不可能的！

rounding both vectors is no option!

例如考虑:

referencelist <- read.table(header=TRUE, text="value  name
154.00312  A
154.07685  B
154.21452  C
154.49545  D
156.77310  E
156.83991  F
159.02992  G
159.65553  H
159.93843  I")

sample <- c(154.00315, 159.02991, 154.07688, 156.77312)

所以我得到了结果:

    name value      reference
1    A   154.00315  154.00312
2    G   159.02991  159.02992
3    B   154.07688  154.07685
4    E   156.77312  156.77310

我可以做的就是使用外部功能，例如

what I can do is using e.g. the outer function like

myDist <- outer(referencelist, sample, FUN=function(x, y) abs(((x - y)/y)*10^6))
matches <- which(myDist < 0.5, arr.ind=TRUE)
data.frame(name = referencelist$name[matches[, 1]], value=sample[matches[, 2]])

或者我可以使用for()循环.

但是我的特殊问题是，参考向量的条目约为1 * 10 ^ 12，而我的样本向量约为1 * 10 ^ 7.因此，通过使用external()，我可以轻松破坏所有工作空间限制，并通过使用for()或链接的for()循环，这将需要几天/几周的时间才能完成.

But my special problem is, that the reference vector has around 1*10^12 entries and my sample vector around 1*10^7. so by using outer() I easily destroy all working space limits and by using a for() or chained for() loop this will took days/weeks to finish.

任何人都知道如何在R中快速完成此操作，尽管仍然很精确，但是在最大消耗计算机的计算机上工作. 64 GB RAM?

Has anybody an idea of how to do this fast in R, still precise but working on a computer consuming max. 64 GB RAM?

感谢您的帮助！

最好的祝愿

推荐答案

您的匹配条件

abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

可以改写为

sample[i] * (1 - eps) < referencelist < sample[i] * (1 + eps)

和eps = 0.5E-6.

使用此方法，我们可以使用 non-equi-join 在referencelist中为每个sample查找所有匹配项(不仅是最接近的匹配项！):

Using this, we can use a non-equi-join to find all matches (not only the nearest!) in referencelist for each sample:

library(data.table)
options(digits = 10)
eps <- 0.5E-6 # tol * 1E6
setDT(referencelist)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]

可再现预期结果:

   name     value reference
1:    A 154.00315 154.00312
2:    G 159.02991 159.02992
3:    B 154.07688 154.07685
4:    E 156.77312 156.77310

响应

In response to OP's comment, let's say, we have a modified referencelist2 with F = 154.00320 then this will be caught too:

setDT(referencelist2)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]

   name     value reference
1:    A 154.00315 154.00312
2:    F 154.00315 154.00320
3:    G 159.02991 159.02992
4:    B 154.07688 154.07685
5:    E 156.77312 156.77310

这篇关于将两个非常大的向量与容差匹配(快速！但要节省工作空间)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将两个非常大的向量与容差匹配(快速！但要节省工作空间) [英] Matching two very very large vectors with tolerance (fast! but working space sparing)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将两个非常大的向量与容差匹配(快速！但要节省工作空间) [英] Matching two very very large vectors with tolerance (fast! but working space sparing)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭