返回输入的每个特征的计数数组 [英] Return array of counts for each feature of input
问题描述
我有一个整数标签数组,我想确定每个标签有多少个,并将这些值存储在与输入大小相同的数组中.这可以通过以下循环来完成:
I have an array of integer labels and I would like to determine how many of each label is present and store those values in an array of the same size as the input. This can be accomplished with the following loop:
def counter(labels):
sizes = numpy.zeros(labels.shape)
for num in numpy.unique(labels):
mask = labels == num
sizes[mask] = numpy.count_nonzero(mask)
return sizes
带输入:
array = numpy.array([
[0, 1, 2, 3],
[0, 1, 1, 3],
[3, 1, 3, 1]])
counter()
返回:
array([[ 2., 5., 1., 4.],
[ 2., 5., 5., 4.],
[ 4., 5., 4., 5.]])
然而,对于具有许多独特标签的大型数组,在我的例子中为 60,000,这需要相当长的时间.这是复杂算法的第一步,我不能在这一步上花费超过 30 秒.是否已经存在可以完成此操作的功能?如果没有,我如何加快现有循环的速度?
However, for large arrays, with many unique labels, 60,000 in my case, this takes a considerable amount time. This is the first step in a complex algorithm and I can't afford to spend more than about 30 seconds on this step. Is there a function that already exists that can accomplish this? If not, how can I speed up the existing loop?
推荐答案
方法 #1
这是一个使用 np.unique
-
Here's one using np.unique
-
_, tags, count = np.unique(labels, return_counts=1, return_inverse=1)
sizes = count[tags]
方法#2
在 labels
中使用正数,使用 np.bincount
-
With positive numbers in labels
, simpler and more efficient way with np.bincount
-
sizes = np.bincount(labels)[labels]
<小时>
运行时测试
使用 60,000
唯一正数和两组这样的长度 100,000
和 1000,000
进行设置.
Setup with 60,000
unique positive numbers and two such sets of lengths 100,000
and 1000,000
are timed.
第 1 组:
In [192]: np.random.seed(0)
...: labels = np.random.randint(0,60000,(100000))
In [193]: %%timeit
...: sizes = np.zeros(labels.shape)
...: for num in np.unique(labels):
...: mask = labels == num
...: sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 2.32 s per loop
In [194]: %timeit np.bincount(labels)[labels]
1000 loops, best of 3: 376 µs per loop
In [195]: 2320/0.376 # Speedup figure
Out[195]: 6170.212765957447
第 2 组:
In [196]: np.random.seed(0)
...: labels = np.random.randint(0,60000,(1000000))
In [197]: %%timeit
...: sizes = np.zeros(labels.shape)
...: for num in np.unique(labels):
...: mask = labels == num
...: sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 43.6 s per loop
In [198]: %timeit np.bincount(labels)[labels]
100 loops, best of 3: 5.15 ms per loop
In [199]: 43600/5.15 # Speedup figure
Out[199]: 8466.019417475727
这篇关于返回输入的每个特征的计数数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!