Numpy 数组:具有随机关系的行/列 argmax [英] Numpy arrays: row/column wise argmax with random ties

查看:30
本文介绍了Numpy 数组:具有随机关系的行/列 argmax的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我在 Python 2.7 中尝试用 Numpy 做的事情.假设我有一个由以下定义的数组 a:

Here is what I am trying to do with Numpy in Python 2.7. Suppose I have an array a defined by the following:

a = np.array([[1,3,3],[4,5,6],[7,8,1]])

我可以做 a.argmax(0)a.argmax(1) 来获得行/列明智的 argmax:

I can do a.argmax(0) or a.argmax(1) to get the row/column wise argmax:

a.argmax(0)
Out[329]: array([2, 2, 1], dtype=int64)
a.argmax(1)
Out[330]: array([1, 2, 1], dtype=int64)

但是,当 a 的第一行出现平局时,我希望在平局之间随机决定 argmax(默认情况下,每当出现平局时,Numpy 都会返回第一个元素在 argmax 或 argmin 中).

However, when there is a tie like in a's first row, I would like to get the argmax decided randomly between the ties (by default, Numpy returns the first element whenever a tie occurs in argmax or argmin).

去年,有人提出了一个关于随机解决Numpy argmax/argmin ties的问题:按列索引在 Numpy 数组的每一行中选择一个元素

Last year, someone put a question on solving Numpy argmax/argmin ties randomly: Select One Element in Each Row of a Numpy Array by Column Indices

然而,该问题针对一维数组.在那里,投票最多的答案对此很有效.有第二个答案也试图解决多维数组的问题,但不起作用 - 即它不返回,对于每一行/列,最大值的索引与随机解决的关系.

However, the question aimed at uni-dimensional arrays. There, the most voted answer works well for that. There is a second answer that attempts to solve the problem also for multidimensional arrays but doesn't work - i.e. it does not return, for each row/column the index of the maximum value with ties solved randomly.

因为我正在处理大数组,所以最高效的方法是什么?

What would be the most performent way to do that, since I am working with big arrays?

推荐答案

一般案例解决方案,每组挑一个

为了解决从指定选择范围的数字列表/数组中选择随机数的一般情况,我们将使用创建统一 rand 数组的技巧,添加由间隔长度指定的偏移量,然后执行argsort.实现看起来像这样 -

Generic case solution to pick one per group

To solve a general case of picking a random number from a list/array of numbers that specify the ranges for the picks, we would use a trick of creating a uniform rand array, add offset specified by the interval lengths and then perform argsort. The implementation would look something like this -

def random_num_per_grp(L):
    # For each element in L pick a random number within range specified by it
    r1 = np.random.rand(np.sum(L)) + np.repeat(np.arange(len(L)),L)
    offset = np.r_[0,np.cumsum(L[:-1])]
    return r1.argsort()[offset] - offset

示例案例 -

In [217]: L = [5,4,2]

In [218]: random_num_per_grp(L) # i.e. select one per [0-5,0-4,0-2]
Out[218]: array([2, 0, 1])

因此,输出将具有与输入 L 中相同数量的元素,第一个输出元素将在 [0,5) 中,第二个在 [0,4) 等等.

So, the output would have same number of elements as in input L and the first output element would be in [0,5), second in [0,4) and so on.

为了解决我们这里的情况,我们将使用一个修改版本(特别是移除 func 末尾的偏移去除部分,就像这样 -

To solve our case here, we would use a modified version (specifically remove the offset removal part at the end of the func, like so -

def random_num_per_grp_cumsumed(L):
    # For each element in L pick a random number within range specified by it
    # The final output would be a cumsumed one for use with indexing, etc.
    r1 = np.random.rand(np.sum(L)) + np.repeat(np.arange(len(L)),L)
    offset = np.r_[0,np.cumsum(L[:-1])]
    return r1.argsort()[offset] 

方法#1

一种解决方案可以像这样使用它 -

One solution could use it like so -

def argmax_per_row_randtie(a):
    max_mask = a==a.max(1,keepdims=1)
    m,n = a.shape
    all_argmax_idx = np.flatnonzero(max_mask)
    offset = np.arange(m)*n
    return all_argmax_idx[random_num_per_grp_cumsumed(max_mask.sum(1))] - offset

验证

让我们对给定的样本进行大量测试,并计算每行每个索引的出现次数

Let's test out on the given sample with a huge number of runs and count number of occurences for each index for each row

In [235]: a
Out[235]: 
array([[1, 3, 3],
       [4, 5, 6],
       [7, 8, 1]])

In [225]: all_out = np.array([argmax_per_row_randtie(a) for i in range(10000)])

# The first element (row=0) should have similar probabilities for 1 and 2
In [236]: (all_out[:,0]==1).mean()
Out[236]: 0.504

In [237]: (all_out[:,0]==2).mean()
Out[237]: 0.496

# The second element (row=1) should only have 2
In [238]: (all_out[:,1]==2).mean()
Out[238]: 1.0

# The third element (row=2) should only have 1
In [239]: (all_out[:,2]==1).mean()
Out[239]: 1.0

方法#2:使用masking提高性能

Approach #2 : Use masking for performance

我们可以使用 masking 并因此避免 flatnonzero 以提高性能,就像使用布尔数组一样.此外,我们将概括涵盖行(轴 = 1)和列(轴 = 0),以给自己一个修改过的,就像这样 -

We could make use of masking and hence avoid that flatnonzero with the intention of gaining performance as working with boolean arrays generally is. Also, we would generalize to cover both rows (axis=1) and columns(axis=0) to give ourselves a modified one, like so -

def argmax_randtie_masking_generic(a, axis=1): 
    max_mask = a==a.max(axis=axis,keepdims=True)
    m,n = a.shape
    L = max_mask.sum(axis=axis)
    set_mask = np.zeros(L.sum(), dtype=bool)
    select_idx = random_num_per_grp_cumsumed(L)
    set_mask[select_idx] = True
    if axis==0:
        max_mask.T[max_mask.T] = set_mask
    else:
        max_mask[max_mask] = set_mask
    return max_mask.argmax(axis=axis) 

示例在 axis=0axis=1 -

In [423]: a
Out[423]: 
array([[1, 3, 3],
       [4, 5, 6],
       [7, 8, 1]])
In [424]: argmax_randtie_masking_generic(a, axis=1)
Out[424]: array([1, 2, 1])

In [425]: argmax_randtie_masking_generic(a, axis=1)
Out[425]: array([2, 2, 1])

In [426]: a[1,1] = 8

In [427]: a
Out[427]: 
array([[1, 3, 3],
       [4, 8, 6],
       [7, 8, 1]])

In [428]: argmax_randtie_masking_generic(a, axis=0)
Out[428]: array([2, 1, 1])

In [429]: argmax_randtie_masking_generic(a, axis=0)
Out[429]: array([2, 1, 1])

In [430]: argmax_randtie_masking_generic(a, axis=0)
Out[430]: array([2, 2, 1])

这篇关于Numpy 数组:具有随机关系的行/列 argmax的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆