numpy中唯一元素的索引分组 [英] Grouping indices of unique elements in numpy
问题描述
我有很多大的(> 100,000,000)整数列表,其中包含许多重复项.我想获取每个元素出现的索引.目前,我正在执行以下操作:
I have many large (>100,000,000) lists of integers that contain many duplicates. I want to get the indices where each of the element occur. Currently I am doing something like this:
import numpy as np
from collections import defaultdict
a = np.array([1, 2, 6, 4, 2, 3, 2])
d=defaultdict(list)
for i,e in enumerate(a):
d[e].append(i)
d
defaultdict(<type 'list'>, {1: [0], 2: [1, 4, 6], 3: [5], 4: [3], 6: [2]})
这种遍历每个元素的方法非常耗时.是否有一种有效的或矢量化的方法来做到这一点?
This method of iterating through each element is time consuming. Is there a efficient or vectorized way to do this?
编辑1 我在下面尝试了Acorbe和Jaime的方法
Edit1 I tried the methods of Acorbe and Jaime on the following
a = np.random.randint(2000, size=10000000)
结果是
original: 5.01767015457 secs
Acorbe: 6.11163902283 secs
Jaime: 3.79637312889 secs
推荐答案
这与要求,因此下面是我在此处所作回答的改编.向量化的最简单方法是使用排序.以下代码从即将到来的1.9版的np.unique
实现中借鉴了很多东西,其中包括独特的项目计数功能,请参见
This is very similar to what was asked here, so what follows is an adaptation of my answer there. The simplest way to vectorize this is to use sorting. The following code borrows a lot from the implementation of np.unique
for the upcoming version 1.9, which includes unique item counting functionality, see here:
>>> a = np.array([1, 2, 6, 4, 2, 3, 2])
>>> sort_idx = np.argsort(a)
>>> a_sorted = a[idx]
>>> unq_first = np.concatenate(([True], a_sorted[1:] != a_sorted[:-1]))
>>> unq_items = a_sorted[unq_first]
>>> unq_count = np.diff(np.nonzero(unq_first)[0])
现在:
>>> unq_items
array([1, 2, 3, 4, 6])
>>> unq_count
array([1, 3, 1, 1, 1], dtype=int64)
要获取每个值的位置索引,只需执行以下操作:
To get the positional indices for each values, we simply do:
>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count))
>>> unq_idx
[array([0], dtype=int64), array([1, 4, 6], dtype=int64), array([5], dtype=int64),
array([3], dtype=int64), array([2], dtype=int64)]
现在您可以构建将unq_items
和unq_idx
压缩的字典了.
And you can now construct your dictionary zipping unq_items
and unq_idx
.
请注意,unq_count
不会计算最后一个唯一项的出现,因为不需要拆分索引数组.如果您想拥有所有值,则可以执行以下操作:
Note that unq_count
doesn't count the occurrences of the last unique item, because that is not needed to split the index array. If you wanted to have all the values you could do:
>>> unq_count = np.diff(np.concatenate(np.nonzero(unq_first) + ([a.size],)))
>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count[:-1]))
这篇关于numpy中唯一元素的索引分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!