NumPy数组中每行的唯一元素数 [英] Number of unique elements per row in a NumPy array

查看:102
本文介绍了NumPy数组中每行的唯一元素数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,对于

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])

我想得到

[2, 2, 3]

有没有一种方法可以不使用for循环或使用np.vectorize?

Is there a way to do this without for loops or using np.vectorize?

实际数据由1000行组成,每行100个元素,每个元素的范围从1到365.最终目标是确定具有重复项的行的百分比.这是一个我已经解决过的家庭作业问题(使用for循环),但是我只是想知道是否有更好的方法可以使用numpy来完成它.

Actual data consists of 1000 rows of 100 elements each, with each element ranging from 1 to 365. The ultimate goal is to determine the percentage of rows that have duplicates. This was a homework problem which I already solved (with a for loop), but I was just wondering if there was a better way to do it with numpy.

推荐答案

方法1

一种带排序的矢量化方法-

One vectorized approach with sorting -

In [8]: b = np.sort(a,axis=1)

In [9]: (b[:,1:] != b[:,:-1]).sum(axis=1)+1
Out[9]: array([2, 2, 3])

方法2

ints的另一种方法不是很大,可以用偏移量偏移每行,该偏移量会将每行中的元素与其他行区分开,然后进行binned-sumsum并计算每行非零bin的数量-

Another method for ints that aren't very large would be with offsetting each row by an offset that would differentiate elements off each row from others and then doing binned-summation and counting number of non-zero bins per row -

n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
out = (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)


运行时测试

作为功能的方法-


Runtime test

Approaches as funcs -

def sorting(a):
    b = np.sort(a,axis=1)
    return (b[:,1:] != b[:,:-1]).sum(axis=1)+1

def bincount(a):
    n = a.max()+1
    a_off = a+(np.arange(a.shape[0])[:,None])*n
    M = a.shape[0]*n
    return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

# From @wim's post   
def pandas(a):
    df = pd.DataFrame(a.T)
    return df.nunique()

# @jp_data_analysis's soln
def numpy_apply(a):
    return np.apply_along_axis(compose(len, np.unique), 1, a) 

案例1:方形的一个

In [164]: np.random.seed(0)

In [165]: a = np.random.randint(0,5,(10000,10000))

In [166]: %timeit numpy_apply(a)
     ...: %timeit sorting(a)
     ...: %timeit bincount(a)
     ...: %timeit pandas(a)
1 loop, best of 3: 1.82 s per loop
1 loop, best of 3: 1.93 s per loop
1 loop, best of 3: 354 ms per loop
1 loop, best of 3: 879 ms per loop

第2种情况:行数众多

In [167]: np.random.seed(0)

In [168]: a = np.random.randint(0,5,(1000000,10))

In [169]: %timeit numpy_apply(a)
     ...: %timeit sorting(a)
     ...: %timeit bincount(a)
     ...: %timeit pandas(a)
1 loop, best of 3: 8.42 s per loop
10 loops, best of 3: 153 ms per loop
10 loops, best of 3: 66.8 ms per loop
1 loop, best of 3: 53.6 s per loop


扩展为每列唯一元素的数量

要扩展,我们只需要针对两种建议的方法沿另一个轴进行切片和ufunc操作,就像这样-

To extend, we just need to do the slicing and ufunc operations along the other axis for the two proposed approaches, like so -

def nunique_percol_sort(a):
    b = np.sort(a,axis=0)
    return (b[1:] != b[:-1]).sum(axis=0)+1

def nunique_percol_bincount(a):
    n = a.max()+1
    a_off = a+(np.arange(a.shape[1]))*n
    M = a.shape[1]*n
    return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)


具有通用轴的通用ndarray

让我们看看如何扩展到通用维数ndarray并沿通用轴获得那些唯一计数.我们将使用np.diff及其axis参数来获取这些连续的差异,从而使其具有通用性,就像-


Generic ndarray with generic axis

Let's see how we can extend to ndarray of generic dimensions and get those number of unique counts along a generic axis. We will make use of np.diff with its axis param to get those consecutive differences and hence make it generic, like so -

def nunique(a, axis):
    return (np.diff(np.sort(a,axis=axis),axis=axis)!=0).sum(axis=axis)+1

样品运行-

In [77]: a
Out[77]: 
array([[1, 0, 2, 2, 0],
       [1, 0, 1, 2, 0],
       [0, 0, 0, 0, 2],
       [1, 2, 1, 0, 1],
       [2, 0, 1, 0, 0]])

In [78]: nunique(a, axis=0)
Out[78]: array([3, 2, 3, 2, 3])

In [79]: nunique(a, axis=1)
Out[79]: array([3, 3, 2, 3, 3])

如果您使用的是pt浮点数,并且希望基于某个公差值而不是绝对匹配来确定唯一性,则可以使用np.isclose.两个这样的选项是-

If you are working with floating pt numbers and want to make the unique-ness case based on some tolerance value rather than absolute match, we can use np.isclose. Two such options would be -

(~np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0)).sum(axis)+1
a.shape[axis]-np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0).sum(axis)

对于自定义公差值,请输入np.isclose.

For a custom tolerance value, feed those with np.isclose.

这篇关于NumPy数组中每行的唯一元素数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆