将两个非常大的向量与容差匹配(快速!但要节省工作空间) [英] Matching two very very large vectors with tolerance (fast! but working space sparing)

查看:123
本文介绍了将两个非常大的向量与容差匹配(快速!但要节省工作空间)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑我有两个向量.一个是参考向量/列表,其中包括所有感兴趣的值和一个样本矢量,其中可能包含任何可能的值.现在,我想在参考列表中找到具有一定公差的样本匹配项,该公差不是固定的,并且取决于向量中的比较值:

consider I have two vectors. One is a reference vector/list that includes all values of interest and one samplevector that could contain any possible value. Now I want to find matches of my sample inside the reference list with a certain tolerance which is not fixed and depentent on the comparing values inside the vectors:

matches: abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

将两个向量都舍入是不可能的!

rounding both vectors is no option!

例如考虑:

referencelist <- read.table(header=TRUE, text="value  name
154.00312  A
154.07685  B
154.21452  C
154.49545  D
156.77310  E
156.83991  F
159.02992  G
159.65553  H
159.93843  I")

sample <- c(154.00315, 159.02991, 154.07688, 156.77312)

所以我得到了结果:

    name value      reference
1    A   154.00315  154.00312
2    G   159.02991  159.02992
3    B   154.07688  154.07685
4    E   156.77312  156.77310

我可以做的就是使用外部功能,例如

what I can do is using e.g. the outer function like

myDist <- outer(referencelist, sample, FUN=function(x, y) abs(((x - y)/y)*10^6))
matches <- which(myDist < 0.5, arr.ind=TRUE)
data.frame(name = referencelist$name[matches[, 1]], value=sample[matches[, 2]])

或者我可以使用for()循环.

但是我的特殊问题是,参考向量的条目约为1 * 10 ^ 12,而我的样本向量约为1 * 10 ^ 7.因此,通过使用external(),我可以轻松破坏所有工作空间限制,并通过使用for()或链接的for()循环,这将需要几天/几周的时间才能完成.

But my special problem is, that the reference vector has around 1*10^12 entries and my sample vector around 1*10^7. so by using outer() I easily destroy all working space limits and by using a for() or chained for() loop this will took days/weeks to finish.

任何人都知道如何在R中快速完成此操作,尽管仍然很精确,但是在最大消耗计算机的计算机上工作. 64 GB RAM?

Has anybody an idea of how to do this fast in R, still precise but working on a computer consuming max. 64 GB RAM?

感谢您的帮助!

最好的祝愿

推荐答案

您的匹配条件

abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

可以改写为

sample[i] * (1 - eps) < referencelist < sample[i] * (1 + eps)

eps = 0.5E-6.

使用此方法,我们可以使用 non-equi-join referencelist中为每个sample查找所有匹配项(不仅是最接近的匹配项!):

Using this, we can use a non-equi-join to find all matches (not only the nearest!) in referencelist for each sample:

library(data.table)
options(digits = 10)
eps <- 0.5E-6 # tol * 1E6
setDT(referencelist)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]

可再现预期结果:

   name     value reference
1:    A 154.00315 154.00312
2:    G 159.02991 159.02992
3:    B 154.07688 154.07685
4:    E 156.77312 156.77310

响应

In response to OP's comment, let's say, we have a modified referencelist2 with F = 154.00320 then this will be caught too:

setDT(referencelist2)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]

   name     value reference
1:    A 154.00315 154.00312
2:    F 154.00315 154.00320
3:    G 159.02991 159.02992
4:    B 154.07688 154.07685
5:    E 156.77312 156.77310

这篇关于将两个非常大的向量与容差匹配(快速!但要节省工作空间)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆