计算缺少值的scipy中的成对距离 [英] Compute the pairwise distance in scipy with missing values

查看:123
本文介绍了计算缺少值的scipy中的成对距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

scipy.spatial.distance.pdist 处理丢失的(nan)值.

以防万一我弄乱了矩阵的尺寸,让我们避免它.从文档中:

So just in case I messed up the dimensions of my matrix, let's get that out of the way. From the docs:

这些点在矩阵X中按m个n维行向量排列.

The points are arranged as m n-dimensional row vectors in the matrix X.

因此,让我们在10维空间中生成三个值缺失的点:

So let's generate three points in 10 dimensional space with missing values:

numpy.random.seed(123456789)
data = numpy.random.rand(3, 10) * 5
data[data < 1.0] = numpy.nan

如果我计算这三个观测值的欧几里得距离:

If I compute the Euclidean distance of these three observations:

pdist(data, "euclidean")

我得到:

array([ nan,  nan,  nan])

但是,如果我过滤掉所有缺少值的列,我会得到正确的距离值:

However, if I filter all the columns with missing values I do get proper distance values:

valid = [i for (i, col) in enumerate(data.T) if ~numpy.isnan(col).any()]
pdist(data[:, valid], "euclidean")

我得到:

array([ 3.35518662,  2.35481185,  3.10323893])

这样,我丢弃的数据比我想要的多,因为我不需要过滤整个矩阵,而只过滤一次要比较的向量对.我可以使pdist或类似的函数以某种方式执行成对屏蔽吗?

This way, I throw away more data than I'd like since I don't need to filter the whole matrix but only the pairs of vectors being compared at a time. Can I make pdist or a similar function perform pairwise masking, somehow?

由于我的完整矩阵很大,因此我对此处提供的小型数据集进行了一些时序测试.

Since my full matrix is rather large, I did some timing tests on the small data set provided here.

1.)scipy功能.

1.) The scipy function.

%timeit pdist(data, "euclidean")

10000 loops, best of 3: 24.4 µs per loop

2.)不幸的是,到目前为止提供的解决方案慢了大约10倍.

2.) Unfortunately, the solution provided so far is roughly 10 times slower.

%timeit numpy.array([pdist(data[s][:, ~numpy.isnan(data[s]).any(axis=0)], "euclidean") for s in map(list, itertools.combinations(range(data.shape[0]), 2))]).ravel()

1000 loops, best of 3: 231 µs per loop

3.)然后,我对纯" Python进行了测试,并感到惊喜:

3.) Then I did a test of "pure" Python and was pleasantly surprised:

from scipy.linalg import norm

%%timeit
m = data.shape[0]
dm = numpy.zeros(m * (m - 1) // 2, dtype=float)
mask = numpy.isfinite(data)
k = 0
for i in range(m - 1):
    for j in range(i + 1, m):
        curr = numpy.logical_and(mask[i], mask[j])
        u = data[i][curr]
        v = data[j][curr]
        dm[k] = norm(u - v)
        k += 1

10000 loops, best of 3: 98.9 µs per loop

所以我认为继续前进的方法是在函数中对上述代码进行Cythonize.

So I think the way to go forward is to Cythonize the above code in a function.

推荐答案

如果我对您的理解正确,那么您希望两个向量都具有有效值的所有尺寸的距离.

If I understand you correctly, you want the distance for all dimensions that two vector have valid values for.

不幸的是,pdist在这种意义上不能理解掩码数组,因此我修改了半解决方案以不减少信息.但是,它不是最有效的解决方案,也不是最易读的内容:

Unfortunately pdist doesn't understand masked arrays in that sense, so I modified your semi-solution to not reduce information. It is however not the most efficient solution, nor most readable:

np.array([pdist(data[s][:, ~numpy.isnan(data[s]).any(axis=0)], "euclidean") for s in map(list, itertools.combinations(range(data.shape[0]), 2))]).ravel()

将其制成数组并使用ravel只是为了使其形状与您期望的形状相匹配.

The outer making it to an array and ravel is just to get it in a matching shape to what you would expect.

itertools.combinations生成data -array的所有成对可能的索引.

itertools.combinations produces all pairwise possible indices of the data-array.

然后我只对这些数据进行切片(必须为list而不是tuple才能正确切片),并像代码一样对nan进行成对过滤.

I then just slice data on these (must be a list and not a tuple to slice correctly) and do the pairwise filtering of nan just as your code did.

这篇关于计算缺少值的scipy中的成对距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆