返回输入的每个功能的计数数组 [英] Return array of counts for each feature of input
问题描述
我有一个整数标签数组,我想确定每个标签中有多少个并将这些值存储在与输入大小相同的数组中. 这可以通过以下循环来完成:
I have an array of integer labels and I would like to determine how many of each label is present and store those values in an array of the same size as the input. This can be accomplished with the following loop:
def counter(labels):
sizes = numpy.zeros(labels.shape)
for num in numpy.unique(labels):
mask = labels == num
sizes[mask] = numpy.count_nonzero(mask)
return sizes
输入:
array = numpy.array([
[0, 1, 2, 3],
[0, 1, 1, 3],
[3, 1, 3, 1]])
counter()
返回:
array([[ 2., 5., 1., 4.],
[ 2., 5., 5., 4.],
[ 4., 5., 4., 5.]])
但是,对于具有许多唯一标签的大型数组(以我为例,该数组为60,000),这需要花费大量时间.这是复杂算法的第一步,在这一步上我花不了超过30秒的时间.是否已经存在可以实现此目的的功能?如果没有,如何加快现有循环的速度?
However, for large arrays, with many unique labels, 60,000 in my case, this takes a considerable amount time. This is the first step in a complex algorithm and I can't afford to spend more than about 30 seconds on this step. Is there a function that already exists that can accomplish this? If not, how can I speed up the existing loop?
推荐答案
方法1
这里是使用 np.unique
-
Here's one using np.unique
-
_, tags, count = np.unique(labels, return_counts=1, return_inverse=1)
sizes = count[tags]
方法2
在labels
中使用正数,使用 np.bincount
-
With positive numbers in labels
, simpler and more efficient way with np.bincount
-
sizes = np.bincount(labels)[labels]
运行时测试
具有60,000
个唯一正数以及两组长度100,000
和1000,000
的设置是定时的.
Setup with 60,000
unique positive numbers and two such sets of lengths 100,000
and 1000,000
are timed.
设置#1:
In [192]: np.random.seed(0)
...: labels = np.random.randint(0,60000,(100000))
In [193]: %%timeit
...: sizes = np.zeros(labels.shape)
...: for num in np.unique(labels):
...: mask = labels == num
...: sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 2.32 s per loop
In [194]: %timeit np.bincount(labels)[labels]
1000 loops, best of 3: 376 µs per loop
In [195]: 2320/0.376 # Speedup figure
Out[195]: 6170.212765957447
设置#2:
In [196]: np.random.seed(0)
...: labels = np.random.randint(0,60000,(1000000))
In [197]: %%timeit
...: sizes = np.zeros(labels.shape)
...: for num in np.unique(labels):
...: mask = labels == num
...: sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 43.6 s per loop
In [198]: %timeit np.bincount(labels)[labels]
100 loops, best of 3: 5.15 ms per loop
In [199]: 43600/5.15 # Speedup figure
Out[199]: 8466.019417475727
这篇关于返回输入的每个功能的计数数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!