迭代 numpy 数组列的所有成对组合 [英] Iterate over all pairwise combinations of numpy array columns

查看:34
本文介绍了迭代 numpy 数组列的所有成对组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大小为

arr.size = (200, 600, 20). 

我想在最后两个维度的每个成对组合上计算 scipy.stats.kendalltau.例如:

I want to compute scipy.stats.kendalltau on every pairwise combination of the last two dimensions. For example:

kendalltau(arr[:, 0, 0], arr[:, 1, 0])
kendalltau(arr[:, 0, 0], arr[:, 1, 1])
kendalltau(arr[:, 0, 0], arr[:, 1, 2])
...
kendalltau(arr[:, 0, 0], arr[:, 2, 0])
kendalltau(arr[:, 0, 0], arr[:, 2, 1])
kendalltau(arr[:, 0, 0], arr[:, 2, 2])
...
...
kendalltau(arr[:, 598, 20], arr[:, 599, 20])

这样我就覆盖了 arr[:, i, xi]arr[:, j, xj]i i <;jxi in [0,20)xj in [0, 20).这是 (600 choose 2) * 400 个单独的计算,但由于每个计算在我的机器上大约需要 0.002 s,因此多处理不会超过一天模块.

such that I cover all combinations of arr[:, i, xi] with arr[:, j, xj] with i < j and xi in [0,20), xj in [0, 20). This is (600 choose 2) * 400 individual calculations, but since each takes about 0.002 s on my machine, it shouldn't take much longer than a day with the multiprocessing module.

迭代这些列的最佳方法是什么(使用 i)?我想我应该避免类似

What's the best way to go about iterating over these columns (with i<j)? I figure I should avoid something like

for i in range(600):
    for j in range(i+1, 600):
        for xi in range(20):
            for xj in range(20):

最简单的方法是什么?

我更改了标题,因为 Kendall Tau 对这个问题并不重要.我意识到我也可以做类似的事情

I changed the title since Kendall Tau isn't really important to the question. I realize I could also do something like

import itertools as it
for i, j in it.combinations(xrange(600), 2):
    for xi, xj in product(xrange(20), xrange(20)):

但是必须有一种更好、更矢量化的 numpy 方式.

but there's got to be a better, more vectorized way with numpy.

推荐答案

矢量化此类内容的一般方法是使用广播来创建集合与自身的笛卡尔积.在您的情况下,您有一个形状为 (200, 600, 20) 的数组 arr,因此您可以对它进行两个视图:

The general way of vectorizing something like this is to use broadcasting to create the cartesian product of the set with itself. In your case you have an array arr of shape (200, 600, 20), so you would take two views of it:

arr_x = arr[:, :, np.newaxis, np.newaxis, :] # shape (200, 600, 1, 1, 20)
arr_y = arr[np.newaxis, np.newaxis, :, :, :] # shape (1, 1, 200, 600, 20)

为了清楚起见,以上两行已被扩展,但我通常会写成等价的:

The above two lines have been expanded for clarity, but I would normally write the equivalent:

arr_x = arr[:, :, None, None]
arr_y = arr

如果您有一个矢量化函数 f,它在除最后一个维度之外的所有维度上都进行了广播,那么您可以这样做:

If you have a vectorized function, f, that did broadcasting on all but the last dimension, you could then do:

out = f(arr[:, :, None, None], arr)

然后 out 将是一个形状为 (200, 600, 200, 600) 的数组,其中 out[i, j, k, l] 保存 f(arr[i, j], arr[k, l]) 的值.例如,如果你想计算所有的成对内积,你可以这样做:

And then out would be an array of shape (200, 600, 200, 600), with out[i, j, k, l] holding the value of f(arr[i, j], arr[k, l]). For instance, if you wanted to compute all the pairwise inner products, you could do:

from numpy.core.umath_tests import inner1d

out = inner1d(arr[:, :, None, None], arr)

不幸的是 scipy.stats.kendalltau 没有像这样矢量化.根据文档

Unfortunately scipy.stats.kendalltau is not vectorized like this. According to the docs

如果数组不是一维的,它们将被展平为一维."

"If arrays are not 1-D, they will be flattened to 1-D."

所以你不能这样去做,你最终会做 Python 嵌套循环,无论是明确地写出它们,使用 itertools 或在 np.vectorize.这会很慢,因为要对 Python 变量进行迭代,并且因为每个迭代步骤都有一个 Python 函数,这都是昂贵的操作.

So you cannot go about it like this, and you are going to wind up doing Python nested loops, be it explicitly writing them out, using itertools or disguising it under np.vectorize. That's going to be slow, because of the iteration on Python variables, and because you have a Python function per iteration step, which are both expensive actions.

请注意,当您可以采用矢量化方式时,有一个明显的缺点:如果您的函数是可交换的,即如果 f(a, b) == f(b, a),那么您需要进行两倍的计算.根据您的实际计算成本,这通常会被没有任何 Python 循环或函数调用所带来的速度提升所抵消.

Do note that, when you can go the vectorized way, there is an obvious drawback: if your function is commutative, i.e. if f(a, b) == f(b, a), then you are doing twice the computations needed. Depending on how expensive your actual computation is, this is very often offset by the increase in speed from not having any Python loops or function calls.

这篇关于迭代 numpy 数组列的所有成对组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆