numpy中唯一元素的索引分组 [英] Grouping indices of unique elements in numpy

查看:131
本文介绍了numpy中唯一元素的索引分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多大的(> 100,000,000)整数列表,其中包含许多重复项.我想获取每个元素出现的索引.目前,我正在执行以下操作:

I have many large (>100,000,000) lists of integers that contain many duplicates. I want to get the indices where each of the element occur. Currently I am doing something like this:

import numpy as np
from collections import defaultdict

a = np.array([1, 2, 6, 4, 2, 3, 2])
d=defaultdict(list)
for i,e in enumerate(a):
    d[e].append(i)

d
defaultdict(<type 'list'>, {1: [0], 2: [1, 4, 6], 3: [5], 4: [3], 6: [2]})

这种遍历每个元素的方法非常耗时.是否有一种有效的或矢量化的方法来做到这一点?

This method of iterating through each element is time consuming. Is there a efficient or vectorized way to do this?

编辑1 我在下面尝试了Acorbe和Jaime的方法

Edit1 I tried the methods of Acorbe and Jaime on the following

a = np.random.randint(2000, size=10000000)

结果是

original: 5.01767015457 secs
Acorbe: 6.11163902283 secs
Jaime: 3.79637312889 secs

推荐答案

这与要求,因此下面是我在此处所作回答的改编.向量化的最简单方法是使用排序.以下代码从即将到来的1.9版的np.unique实现中借鉴了很多东西,其中包括独特的项目计数功能,请参见

This is very similar to what was asked here, so what follows is an adaptation of my answer there. The simplest way to vectorize this is to use sorting. The following code borrows a lot from the implementation of np.unique for the upcoming version 1.9, which includes unique item counting functionality, see here:

>>> a = np.array([1, 2, 6, 4, 2, 3, 2])
>>> sort_idx = np.argsort(a)
>>> a_sorted = a[idx]
>>> unq_first = np.concatenate(([True], a_sorted[1:] != a_sorted[:-1]))
>>> unq_items = a_sorted[unq_first]
>>> unq_count = np.diff(np.nonzero(unq_first)[0])

现在:

>>> unq_items
array([1, 2, 3, 4, 6])
>>> unq_count
array([1, 3, 1, 1, 1], dtype=int64)

要获取每个值的位置索引,只需执行以下操作:

To get the positional indices for each values, we simply do:

>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count))
>>> unq_idx
[array([0], dtype=int64), array([1, 4, 6], dtype=int64), array([5], dtype=int64),
 array([3], dtype=int64), array([2], dtype=int64)]

现在您可以构建将unq_itemsunq_idx压缩的字典了.

And you can now construct your dictionary zipping unq_items and unq_idx.

请注意,unq_count不会计算最后一个唯一项的出现,因为不需要拆分索引数组.如果您想拥有所有值,则可以执行以下操作:

Note that unq_count doesn't count the occurrences of the last unique item, because that is not needed to split the index array. If you wanted to have all the values you could do:

>>> unq_count = np.diff(np.concatenate(np.nonzero(unq_first) + ([a.size],)))
>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count[:-1]))

这篇关于numpy中唯一元素的索引分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆