在numpy的阵列检查和索引非唯一/重复值 [英] Checking for and indexing non-unique/duplicate values in a numpy array

查看：1084 发布时间：2016/6/3 10:29:18 python arrays numpy unique

本文介绍了在numpy的阵列检查和索引非唯一/重复值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含对象ID的数组 traced_descIDs 我要确定哪些项目是不是这个数组中是独一无二的。然后，对每个独特的副本（小心）ID，我需要找出 traced_descIDs 的这些指标都与它相关联。

I have an array traced_descIDs containing object IDs and I want to identify which items are not unique in this array. Then, for each unique duplicate (careful) ID, I need to identify which indices of traced_descIDs are associated with it.

作为一个例子，如果我们在这里取traced_descIDs，我要发生以下过程：

As an example, if we take the traced_descIDs here, I want the following process to occur:

traced_descIDs = [1, 345, 23, 345, 90, 1]
dupIds = [1, 345]
dupInds = [[0,5],[1,3]]

目前，我正在找出哪些对象有超过1项：

I'm currently finding out which objects have more than 1 entry by:

mentions = np.array([len(np.argwhere( traced_descIDs == i)) for i in traced_descIDs])
dupMask = (mentions > 1)

然而，这需要太长的时间为 LEN（traced_descIDs）是15万左右。是否有一个更快的方法来达到同样的效果？

however, this takes too long as len( traced_descIDs ) is around 150,000. Is there a faster way to achieve the same result?

任何帮助非常AP preciated。干杯。

Any help greatly appreciated. Cheers.

推荐答案

虽然字典是为O（n），Python对象的开销有时使它更方便地使用numpy的的功能，它使用排序，并为O（n *日志N）。你的情况，出发点是：

While dictionaries are O(n), the overhead of Python objects sometimes makes it more convenient to use numpy's functions, which use sorting and are O(n*log n). In your case, the starting point would be:

a = [1, 345, 23, 345, 90, 1]
unq, unq_idx, unq_cnt = np.unique(a, return_inverse=True, return_counts=True)

如果您使用的是版本numpy的早于1.9，那么最后一行必须是：

If you are using a version of numpy earlier than 1.9, then that last line would have to be:

unq, unq_idx = np.unique(a, return_inverse=True)
unq_cnt = np.bincount(unq_idx)

我们已经创建了三个数组的内容是：

The contents of the three arrays we have created are:

>>> unq
array([  1,  23,  90, 345])
>>> unq_idx
array([0, 3, 1, 3, 2, 0])
>>> unq_cnt
array([2, 1, 1, 2])

要获得重复的项目：

cnt_mask = unq_cnt > 1
dup_ids = unq[cnt_mask]

>>> dup_ids
array([  1, 345])

获取指数是一个涉及多一点，但pretty简单：

Getting the indices is a little more involved, but pretty straightforward:

cnt_idx, = np.nonzero(cnt_mask)
idx_mask = np.in1d(unq_idx, cnt_idx)
idx_idx, = np.nonzero(idx_mask)
srt_idx = np.argsort(unq_idx[idx_mask])
dup_idx = np.split(idx_idx[srt_idx], np.cumsum(unq_cnt[cnt_mask])[:-1])

>>> dup_idx
[array([0, 5]), array([1, 3])]

这篇关于在numpy的阵列检查和索引非唯一/重复值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在numpy的阵列检查和索引非唯一/重复值 [英] Checking for and indexing non-unique/duplicate values in a numpy array

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在numpy的阵列检查和索引非唯一/重复值 [英] Checking for and indexing non-unique/duplicate values in a numpy array

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭