返回输入的每个特征的计数数组 [英] Return array of counts for each feature of input

查看:37
本文介绍了返回输入的每个特征的计数数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个整数标签数组,我想确定每个标签有多少个,并将这些值存储在与输入大小相同的数组中.这可以通过以下循环来完成:

I have an array of integer labels and I would like to determine how many of each label is present and store those values in an array of the same size as the input. This can be accomplished with the following loop:

def counter(labels):
    sizes = numpy.zeros(labels.shape)
    for num in numpy.unique(labels):
        mask = labels == num
        sizes[mask] = numpy.count_nonzero(mask)
return sizes

带输入:

array = numpy.array([
       [0, 1, 2, 3],
       [0, 1, 1, 3],
       [3, 1, 3, 1]])

counter() 返回:

array([[ 2.,  5.,  1.,  4.],
       [ 2.,  5.,  5.,  4.],
       [ 4.,  5.,  4.,  5.]])

然而,对于具有许多独特标签的大型数组,在我的例子中为 60,000,这需要相当长的时间.这是复杂算法的第一步,我不能在这一步上花费超过 30 秒.是否已经存在可以完成此操作的功能?如果没有,我如何加快现有循环的速度?

However, for large arrays, with many unique labels, 60,000 in my case, this takes a considerable amount time. This is the first step in a complex algorithm and I can't afford to spend more than about 30 seconds on this step. Is there a function that already exists that can accomplish this? If not, how can I speed up the existing loop?

推荐答案

方法 #1

这是一个使用 np.unique -

Here's one using np.unique -

_, tags, count = np.unique(labels, return_counts=1, return_inverse=1)
sizes = count[tags]

方法#2

labels 中使用正数,使用 np.bincount -

With positive numbers in labels, simpler and more efficient way with np.bincount -

sizes = np.bincount(labels)[labels]

<小时>

运行时测试

使用 60,000 唯一正数和两组这样的长度 100,0001000,000 进行设置.

Setup with 60,000 unique positive numbers and two such sets of lengths 100,000 and 1000,000 are timed.

第 1 组:

In [192]: np.random.seed(0)
     ...: labels = np.random.randint(0,60000,(100000))

In [193]: %%timeit
     ...: sizes = np.zeros(labels.shape)
     ...: for num in np.unique(labels):
     ...:     mask = labels == num
     ...:     sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 2.32 s per loop

In [194]: %timeit np.bincount(labels)[labels]
1000 loops, best of 3: 376 µs per loop

In [195]: 2320/0.376 # Speedup figure
Out[195]: 6170.212765957447

第 2 组:

In [196]: np.random.seed(0)
     ...: labels = np.random.randint(0,60000,(1000000))

In [197]: %%timeit
     ...: sizes = np.zeros(labels.shape)
     ...: for num in np.unique(labels):
     ...:     mask = labels == num
     ...:     sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 43.6 s per loop

In [198]: %timeit np.bincount(labels)[labels]
100 loops, best of 3: 5.15 ms per loop

In [199]: 43600/5.15 # Speedup figure
Out[199]: 8466.019417475727

这篇关于返回输入的每个特征的计数数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆