遍历numpy数组列的所有成对组合 [英] Iterate over all pairwise combinations of numpy array columns

查看:77
本文介绍了遍历numpy数组列的所有成对组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大小为numpy的数组

I have an numpy array of size

arr.size = (200, 600, 20). 

我想在最后两个维度的每个成对组合上计算 scipy.stats.kendalltau .例如:

I want to compute scipy.stats.kendalltau on every pairwise combination of the last two dimensions. For example:

kendalltau(arr[:, 0, 0], arr[:, 1, 0])
kendalltau(arr[:, 0, 0], arr[:, 1, 1])
kendalltau(arr[:, 0, 0], arr[:, 1, 2])
...
kendalltau(arr[:, 0, 0], arr[:, 2, 0])
kendalltau(arr[:, 0, 0], arr[:, 2, 1])
kendalltau(arr[:, 0, 0], arr[:, 2, 2])
...
...
kendalltau(arr[:, 598, 20], arr[:, 599, 20])

这样我就覆盖了 arr [:, i,xi] arr [:, j,xj] i<[0,20)中的j 和 xi,[0,20)中的 xj.这是(600选择2)* 400 个单独的计算,但是由于每个计算在我的计算机上大约需要 0.002 s ,因此多处理时间不会超过一天模块.

such that I cover all combinations of arr[:, i, xi] with arr[:, j, xj] with i < j and xi in [0,20), xj in [0, 20). This is (600 choose 2) * 400 individual calculations, but since each takes about 0.002 s on my machine, it shouldn't take much longer than a day with the multiprocessing module.

迭代这些列(使用 i< j )的最佳方法是什么?我认为我应该避免类似的事情

What's the best way to go about iterating over these columns (with i<j)? I figure I should avoid something like

for i in range(600):
    for j in range(i+1, 600):
        for xi in range(20):
            for xj in range(20):

最numpythonic的方法是什么?

What is the most numpythonic way of doing this?

编辑:我更改了标题,因为Kendall Tau对这个问题并不是很重要.我意识到我也可以做类似的事情

I changed the title since Kendall Tau isn't really important to the question. I realize I could also do something like

import itertools as it
for i, j in it.combinations(xrange(600), 2):
    for xi, xj in product(xrange(20), xrange(20)):

但是numpy必须有一个更好的,更矢量化的方式.

but there's got to be a better, more vectorized way with numpy.

推荐答案

矢量化矢量的一般方法是使用广播来自身创建集合的笛卡尔积.在您的情况下,您有一个形状为(200,600,20)的数组 arr ,因此您将获得两个视图:

The general way of vectorizing something like this is to use broadcasting to create the cartesian product of the set with itself. In your case you have an array arr of shape (200, 600, 20), so you would take two views of it:

arr_x = arr[:, :, np.newaxis, np.newaxis, :] # shape (200, 600, 1, 1, 20)
arr_y = arr[np.newaxis, np.newaxis, :, :, :] # shape (1, 1, 200, 600, 20)

为清楚起见,对以上两行进行了扩展,但我通常会写等效的内容:

The above two lines have been expanded for clarity, but I would normally write the equivalent:

arr_x = arr[:, :, None, None]
arr_y = arr

如果您有一个矢量化的函数 f ,除了最后一个维度,它都在所有维度上进行了广播,那么您可以执行以下操作:

If you have a vectorized function, f, that did broadcasting on all but the last dimension, you could then do:

out = f(arr[:, :, None, None], arr)

然后 out 将是形状为(200,600,200,600)的数组,其中 out [i,j,k,l]保留 f(arr [i,j],arr [k,l])的值.例如,如果要计算所有成对的内积,则可以执行以下操作:

And then out would be an array of shape (200, 600, 200, 600), with out[i, j, k, l] holding the value of f(arr[i, j], arr[k, l]). For instance, if you wanted to compute all the pairwise inner products, you could do:

from numpy.core.umath_tests import inner1d

out = inner1d(arr[:, :, None, None], arr)

不幸的是, scipy.stats.kendalltau 并非像这样矢量化.根据文档

Unfortunately scipy.stats.kendalltau is not vectorized like this. According to the docs

如果数组不是一维数组,它们将被展平为一维数组."

"If arrays are not 1-D, they will be flattened to 1-D."

所以您不能像这样处理它,并且您将结束Python嵌套循环,无论是使用 itertools 明确地写出它们还是在

So you cannot go about it like this, and you are going to wind up doing Python nested loops, be it explicitly writing them out, using itertools or disguising it under np.vectorize. That's going to be slow, because of the iteration on Python variables, and because you have a Python function per iteration step, which are both expensive actions.

请注意,当您可以采用向量化方式时,存在一个明显的缺点:如果您的函数是可交换的,即 f(a,b)== f(b,a),那么您需要进行两倍的计算.根据实际计算的昂贵程度,通常可以通过不进行任何Python循环或调用函数来提高速度来弥补这一点.

Do note that, when you can go the vectorized way, there is an obvious drawback: if your function is commutative, i.e. if f(a, b) == f(b, a), then you are doing twice the computations needed. Depending on how expensive your actual computation is, this is very often offset by the increase in speed from not having any Python loops or function calls.

这篇关于遍历numpy数组列的所有成对组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆