Numpy用另一个数组的值总结一个数组 [英] Numpy summarize one array by values of another

查看:162
本文介绍了Numpy用另一个数组的值总结一个数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到一种向量化方法来完成以下工作:

I am trying to find a vectorized way to accomplish the follwing:

说我有一个x和y值的数组.请注意,x值并不总是ints且可以为负:

Say I have an array of x and y values. Note that the x values are not always ints and CAN be negative:

import numpy as np
x = np.array([-1,-1,-1,3,2,2,2,5,4,4], dtype=float)
y = np.array([0,1,0,1,0,1,0,1,0,1])

我想按x数组的排序后的唯一值对y数组进行分组,并汇总每个y类的计数.因此,上面的示例将如下所示:

I want to group the y array by the sorted, unique values of the x array and summarize the counts for each y class. So the example above would look like this:

array([[ 2.,  1.],
      [ 2.,  1.],
      [ 0.,  1.],
      [ 1.,  1.],
      [ 0.,  1.]])

第一列代表x的每个唯一值的"0"值计数,第二列代表x的每个唯一值的"1"值计数.

Where the first column represents the count of '0' values for each unique value of x and the second column represents the count of '1' values for each unique value of x.

我当前的实现如下:

x_sorted, y_sorted = x[x.argsort()], y[x.argsort()]

def collapse(x_sorted, y_sorted):
     uniq_ids = np.unique(x_sorted, return_index=True)[1]
     y_collapsed = np.zeros((len(uniq_ids), 2))
     x_collapsed = x_sorted[uniq_ids]
     for idx, y in enumerate(np.split(y_sorted, uniq_ids[1:])):
          y_collapsed[idx,0] = (y == 0).sum()
          y_collapsed[idx,1] = (y == 1).sum()
     return (x_collapsed, y_collapsed)

collapse(x_sorted, y_sorted)
(array([-1, 2, 3, 4, 5]),
 array([[ 2.,  1.],
       [ 2.,  1.],
       [ 0.,  1.],
       [ 1.,  1.],
       [ 0.,  1.]]))

这似乎不是numpy的精髓,但是我希望这种操作可以使用一些矢量化方法.我正在尝试不使用熊猫来做到这一点.我知道该库具有非常方便的groupby操作.

This doesn't seem very much in the spirit of numpy, however, and I'm hoping some vectorized method exists for this kind of operation. I am trying to do this without resorting to pandas. I know that library has a very convenient groupby operation.

推荐答案

由于xfloat.我会这样做:

In [136]:

np.array([(x[y==0]==np.unique(x)[..., np.newaxis]).sum(axis=1),
          (x[y==1]==np.unique(x)[..., np.newaxis]).sum(axis=1)]).T
Out[136]:
array([[2, 1],
       [2, 1],
       [0, 1],
       [1, 1],
       [0, 1]])

速度:

In [152]:

%%timeit
ux=np.unique(x)[..., np.newaxis]
np.array([(x[y==0]==ux).sum(axis=1),
          (x[y==1]==ux).sum(axis=1)]).T
10000 loops, best of 3: 92.7 µs per loop

解决方案@seikichi

Solution @seikichi

In [151]:

%%timeit
>>> x = np.array([1.1, 1.1, 1.1, 3.3, 2.2, 2.2, 2.2, 5.5, 4.4, 4.4])
>>> y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
>>> r = np.r_[np.unique(x), np.inf]
>>> np.concatenate([[np.histogram(x[y == v], r)[0]] for v in sorted(set(y))]).T
1000 loops, best of 3: 388 µs per loop

对于更常见的情况,如@askewchan指出的,y不仅仅是{0,1}:

For more general cases when y is not just {0,1}, as @askewchan pointed out:

In [155]:

%%timeit
ux=np.unique(x)[..., np.newaxis]
uy=np.unique(y)
np.asanyarray([(x[y==v]==ux).sum(axis=1) for v in uy]).T
10000 loops, best of 3: 116 µs per loop

要进一步解释广播,请参见以下示例:

To explain the broadcasting further, see this example:

In [5]:

np.unique(a)
Out[5]:
array([ 0. ,  0.2,  0.4,  0.5,  0.6,  1.1,  1.5,  1.6,  1.7,  2. ])
In [8]:

np.unique(a)[...,np.newaxis] #what [..., np.newaxis] will do:
Out[8]:
array([[ 0. ],
       [ 0.2],
       [ 0.4],
       [ 0.5],
       [ 0.6],
       [ 1.1],
       [ 1.5],
       [ 1.6],
       [ 1.7],
       [ 2. ]])
In [10]:

(a==np.unique(a)[...,np.newaxis]).astype('int') #then we can boardcast (converted to int for readability)
Out[10]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0]])
In [11]:

(a==np.unique(a)[...,np.newaxis]).sum(axis=1) #getting the count of unique value becomes summing among the 2nd axis
Out[11]:
array([1, 3, 1, 1, 2, 1, 1, 1, 1, 3])

这篇关于Numpy用另一个数组的值总结一个数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆