numpy:根据关联对值进行分组/装箱 [英] Numpy : Grouping/ binning values based on associations

查看:159
本文介绍了numpy:根据关联对值进行分组/装箱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Forgive me for a vague title. I honestly don't know which title will suit this question. If you have a better title, let's change it so that it will be apt for the problem at hand.

假设result是2D数组,而values是1D数组. values保留一些与result中每个元素关联的值. values中的元素到result的映射存储在x_mappingy_mapping中. result中的位置可以与不同的值关联.现在,我必须找到按关联分组的值的总和.

Let's say result is a 2D array and values is a 1D array. values holds some values associated with each element in result. The mapping of an element in values to result is stored in x_mapping and y_mapping. A position in result can be associated with different values. Now, I have to find the sum of the values grouped by associations.

一个更好地说明问题的例子.

An example for better clarification.

result数组:

[[0, 0],
[0, 0],
[0, 0],
[0, 0]]

values数组:

[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.]

注意:此处resultvalues具有相同数量的元素.但事实并非如此.大小之间根本没有关系.

Note: Here result and values have the same number of elements. But it might not be the case. There is no relation between the sizes at all.

x_mappingy_mapping具有从1D values到2D result的映射. x_mappingy_mappingvalues的大小将相同.

x_mapping and y_mapping have mappings from 1D values to 2D result. The sizes of x_mapping, y_mapping and values will be the same.

x_mapping-[0, 1, 0, 0, 0, 0, 0, 0]

y_mapping-[0, 3, 2, 2, 0, 3, 2, 1]

此处,第一个值(values[0])的x为0,y为0(x_mapping[0]y_mappping[0]),因此与result[0, 0]相关联.如果我们正在计算关联数,则result[0,0]处的元素值将为2,因为第1个值和第5个值与result[0, 0]相关联.如果我们求和,则result[0, 0] = value[0] + value[4]为6.

Here, 1st value(values[0]) have x as 0 and y as 0(x_mapping[0] and y_mappping[0]) and hence associated with result[0, 0]. If we are counting the number of associations, then element value at result[0,0] will be 2 as 1st value and 5th value are associated with result[0, 0]. If we are taking the sum, the result[0, 0] = value[0] + value[4] which is 6.

# Initialisation. No connection with the solution.
result = np.zeros([4,2], dtype=np.int16)

values =  np.linspace(start=1, stop=8, num=8)
y_mapping = np.random.randint(low=0, high=values.shape[0], size=values.shape[0])
x_mapping = np.random.randint(low=0, high=values.shape[1], size=values.shape[0])
# Summing the values associated with x,y (current solution.)
for i in range(values.size):
    x = x_mapping[i]
    y = y_mapping[i]
    result[-y, x] = result[-y, x] + values[i]

result

[[6, 0],
[ 6, 2],
[14, 0],
[ 8, 0]]

解决方案失败;但是为什么呢?

test_result = np.zeros_like(result)
test_result[-y_mapping, x_mapping] = test_result[-y_mapping, x_mapping] + values # solution

令我惊讶的是,test_result中的元素被覆盖. test_result

To my surprise elements are overwritten in test_result. Values at test_result,

[[5, 0],
[6, 2],
[7, 0],
[8, 0]]

问题

1.为什么在第二种解决方案中,每个元素都被覆盖?

正如@Divakar在回答中的评论中指出的那样- 在test_result[-y_mapping, x_mapping] =中重复索引时,NumPy不分配累积/求和的值.它从实例之一中随机分配.

Question

1. Why, in the second solution, every element is overwritten?

As @Divakar has pointed out in the comment in his answer - NumPy doesn't assign accumulated/summed values when the indices are repeated in test_result[-y_mapping, x_mapping] =. It randomly assigns from one of the instances.

@Divakar答案中的方法2给了我很好的结果.对于23315个关联,for循环花费了50毫秒,而方法1花费了1.85毫秒.击败所有这些方法后,方法2耗时668 µs.

Approach #2 in @Divakar's answer gives me good results. For 23315 associations, for loop took 50 ms while Approach #1 took 1.85 ms. Beating all these, Approach #2 took 668 µs.

我正在i7处理器上使用Numpy版本1.14.3和Python 3.5.2.

I'm using Numpy version 1.14.3 with Python 3.5.2 on an i7 processor.

推荐答案

方法1

对于大多数重复索引,最直观的是np.add.at-

Most intutive one would be with np.add.at for those repeated indices -

np.add.at(result, [-y_mapping, x_mapping], values)

方法2

由于x,y索引的可能重复性质,我们需要执行合并求和.因此,另一种方法可能是使用NumPy的装箱求和func:np.bincount并具有类似的实现-

We need to perform binned summations owing to the possible repeated nature of x,y indices. Hence, another way could be to use NumPy's binned summation func : np.bincount and have an implementation like so -

# Get linear index equivalents off the x and y indices into result array
m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping

# Get binned summations off values based on linear index as bins
binned_sums = np.bincount(lidx, values, minlength=m*n)

# Finally add into result array
result += binned_sums.astype(result.dtype).reshape(m,n)

如果您始终从result的零数组开始,则可以使用-

If you are always starting off with a zeros array for result, the last step could be made more performant with -

result = binned_sums.astype(out_dtype).reshape(m,n)

这篇关于numpy:根据关联对值进行分组/装箱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆