返回输入的每个功能的计数数组 [英] Return array of counts for each feature of input

查看:86
本文介绍了返回输入的每个功能的计数数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个整数标签数组,我想确定每个标签中有多少个并将这些值存储在与输入大小相同的数组中. 这可以通过以下循环来完成:

I have an array of integer labels and I would like to determine how many of each label is present and store those values in an array of the same size as the input. This can be accomplished with the following loop:

def counter(labels):
    sizes = numpy.zeros(labels.shape)
    for num in numpy.unique(labels):
        mask = labels == num
        sizes[mask] = numpy.count_nonzero(mask)
return sizes

输入:

array = numpy.array([
       [0, 1, 2, 3],
       [0, 1, 1, 3],
       [3, 1, 3, 1]])

counter()返回:

array([[ 2.,  5.,  1.,  4.],
       [ 2.,  5.,  5.,  4.],
       [ 4.,  5.,  4.,  5.]])

但是,对于具有许多唯一标签的大型数组(以我为例,该数组为60,000),这需要花费大量时间.这是复杂算法的第一步,在这一步上我花不了超过30秒的时间.是否已经存在可以实现此目的的功能?如果没有,如何加快现有循环的速度?

However, for large arrays, with many unique labels, 60,000 in my case, this takes a considerable amount time. This is the first step in a complex algorithm and I can't afford to spend more than about 30 seconds on this step. Is there a function that already exists that can accomplish this? If not, how can I speed up the existing loop?

推荐答案

方法1

这里是使用 np.unique -

Here's one using np.unique -

_, tags, count = np.unique(labels, return_counts=1, return_inverse=1)
sizes = count[tags]

方法2

labels中使用正数,使用 np.bincount -

With positive numbers in labels, simpler and more efficient way with np.bincount -

sizes = np.bincount(labels)[labels]


运行时测试

具有60,000个唯一正数以及两组长度100,0001000,000的设置是定时的.

Setup with 60,000 unique positive numbers and two such sets of lengths 100,000 and 1000,000 are timed.

设置#1:

In [192]: np.random.seed(0)
     ...: labels = np.random.randint(0,60000,(100000))

In [193]: %%timeit
     ...: sizes = np.zeros(labels.shape)
     ...: for num in np.unique(labels):
     ...:     mask = labels == num
     ...:     sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 2.32 s per loop

In [194]: %timeit np.bincount(labels)[labels]
1000 loops, best of 3: 376 µs per loop

In [195]: 2320/0.376 # Speedup figure
Out[195]: 6170.212765957447

设置#2:

In [196]: np.random.seed(0)
     ...: labels = np.random.randint(0,60000,(1000000))

In [197]: %%timeit
     ...: sizes = np.zeros(labels.shape)
     ...: for num in np.unique(labels):
     ...:     mask = labels == num
     ...:     sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 43.6 s per loop

In [198]: %timeit np.bincount(labels)[labels]
100 loops, best of 3: 5.15 ms per loop

In [199]: 43600/5.15 # Speedup figure
Out[199]: 8466.019417475727

这篇关于返回输入的每个功能的计数数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆