将两个非常大的向量与容差匹配(快速!但要节省工作空间) [英] Matching two very very large vectors with tolerance (fast! but working space sparing)
问题描述
考虑我有两个向量.一个是参考向量/列表,其中包括所有感兴趣的值和一个样本矢量,其中可能包含任何可能的值.现在,我想在参考列表中找到具有一定公差的样本匹配项,该公差不是固定的,并且取决于向量中的比较值:
consider I have two vectors. One is a reference vector/list that includes all values of interest and one samplevector that could contain any possible value. Now I want to find matches of my sample inside the reference list with a certain tolerance which is not fixed and depentent on the comparing values inside the vectors:
matches: abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5
将两个向量都舍入是不可能的!
rounding both vectors is no option!
例如考虑:
referencelist <- read.table(header=TRUE, text="value name
154.00312 A
154.07685 B
154.21452 C
154.49545 D
156.77310 E
156.83991 F
159.02992 G
159.65553 H
159.93843 I")
sample <- c(154.00315, 159.02991, 154.07688, 156.77312)
所以我得到了结果:
name value reference
1 A 154.00315 154.00312
2 G 159.02991 159.02992
3 B 154.07688 154.07685
4 E 156.77312 156.77310
我可以做的就是使用外部功能,例如
what I can do is using e.g. the outer function like
myDist <- outer(referencelist, sample, FUN=function(x, y) abs(((x - y)/y)*10^6))
matches <- which(myDist < 0.5, arr.ind=TRUE)
data.frame(name = referencelist$name[matches[, 1]], value=sample[matches[, 2]])
或者我可以使用for()
循环.
但是我的特殊问题是,参考向量的条目约为1 * 10 ^ 12,而我的样本向量约为1 * 10 ^ 7.因此,通过使用external(),我可以轻松破坏所有工作空间限制,并通过使用for()
或链接的for()
循环,这将需要几天/几周的时间才能完成.
But my special problem is, that the reference vector has around 1*10^12 entries and my sample vector around 1*10^7. so by using outer() I easily destroy all working space limits and by using a for()
or chained for()
loop this will took days/weeks to finish.
任何人都知道如何在R中快速完成此操作,尽管仍然很精确,但是在最大消耗计算机的计算机上工作. 64 GB RAM?
Has anybody an idea of how to do this fast in R, still precise but working on a computer consuming max. 64 GB RAM?
感谢您的帮助!
最好的祝愿
推荐答案
您的匹配条件
abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5
可以改写为
sample[i] * (1 - eps) < referencelist < sample[i] * (1 + eps)
和eps = 0.5E-6
.
使用此方法,我们可以使用 non-equi-join 在referencelist
中为每个sample
查找所有匹配项(不仅是最接近的匹配项!):
Using this, we can use a non-equi-join to find all matches (not only the nearest!) in referencelist
for each sample
:
library(data.table)
options(digits = 10)
eps <- 0.5E-6 # tol * 1E6
setDT(referencelist)[.(value = sample,
lower = sample * (1 - eps),
upper = sample * (1 + eps)),
on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]
可再现预期结果:
name value reference
1: A 154.00315 154.00312
2: G 159.02991 159.02992
3: B 154.07688 154.07685
4: E 156.77312 156.77310
In response to OP's comment, let's say, we have a modified referencelist2
with F = 154.00320
then this will be caught too:
setDT(referencelist2)[.(value = sample,
lower = sample * (1 - eps),
upper = sample * (1 + eps)),
on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]
name value reference
1: A 154.00315 154.00312
2: F 154.00315 154.00320
3: G 159.02991 159.02992
4: B 154.07688 154.07685
5: E 156.77312 156.77310
这篇关于将两个非常大的向量与容差匹配(快速!但要节省工作空间)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!