确定数组中的重复值 [英] Determining duplicate values in an array

查看:126
本文介绍了确定数组中的重复值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个数组

  a = np.array([1,2,1,3,3,3 ,0])

我如何(高效地,Python)找到哪些元素 a 是重复的(即非唯一值)?在这种情况下,结果将是 array([1,3,3])或可能 array([1,3])如果有效率。



我已经提出了一些似乎有效的方法:



掩饰



  m = np.zeros_like(a,dtype = bool)
m [np.unique(a,return_index = True)[1]] = True
a [〜m]



设置操作



  a [〜np.in1d(np.arange(len(a)),np.unique(a,return_index = True)[1 ],假设_unique = True)] 

这个可爱但可能是非法的(如 a 实际上不是唯一的):

  np.setxor1d(a,np.unique a),puts_unique = True)



直方图



  u,i = np.unique(a,return_inverse = True)
u [np.bincount(i)> 1]



排序



  s = np.sort(a,axis = None)
s [s [1:] == s [: - 1]]
pre>

熊猫



  s = pd.Series(a)
s [s.duplicated()]

有没有什么我错过的?我不一定要寻找一个仅限numpy的解决方案,但它必须使用numpy数据类型,并在中型数据集(高达1000万的大小)上有效。






结论



使用1000万大小的数据集(2.8GHz至强)进行测试: / p>

  a = np.random.randint(10 ** 7,size = 10 ** 7)

最快的是排序,在1.1s。可疑的 xor1d 在2.6秒之后是第二,其次是3.1s,$ $ c屏蔽和Pandas Series.duplicated $ c> bincount 在5.6s和 in1d 和senderle的 setdiff1d 都在7.3s 。史蒂文的计数器只有一点慢点,在10.5s;后面是Burhan的 Counter.most_common 在110s和DSM的计数器减去360s。



我要使用排序来表现,但是我接受了史蒂文的回答,因为表现是可以接受的,而且它更清晰,更多的是Pythonic。



编辑:发现熊猫解决方案。如果熊猫是可用的,它很清楚,效果很好。

解决方案

我认为这是最明显的,在 numpy的。如果您关心速度,您将不得不按照您的 numpy 解决方案。

 >>>导入numpy为np 
>>>>从集合导入计数器
>>>> a = np.array([1,2,1,3,3,3,0])
>>> [item for item,count in Counter(a).iteritems()if count> 1]
[1,3]

注意:类似于Burhan Khalid的答案,但在条件下使用 iteritems 而不需要下标。


Suppose I have an array

a = np.array([1, 2, 1, 3, 3, 3, 0])

How can I (efficiently, Pythonically) find which elements of a are duplicates (i.e., non-unique values)? In this case the result would be array([1, 3, 3]) or possibly array([1, 3]) if efficient.

I've come up with a few methods that appear to work:

Masking

m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]

Set operations

a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]

This one is cute but probably illegal (as a isn't actually unique):

np.setxor1d(a, np.unique(a), assume_unique=True)

Histograms

u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]

Sorting

s = np.sort(a, axis=None)
s[s[1:] == s[:-1]]

Pandas

s = pd.Series(a)
s[s.duplicated()]

Is there anything I've missed? I'm not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).


Conclusions

Testing with a 10 million size data set (on a 2.8GHz Xeon):

a = np.random.randint(10**7, size=10**7)

The fastest is sorting, at 1.1s. The dubious xor1d is second at 2.6s, followed by masking and Pandas Series.duplicated at 3.1s, bincount at 5.6s, and in1d and senderle's setdiff1d both at 7.3s. Steven's Counter is only a little slower, at 10.5s; trailing behind are Burhan's Counter.most_common at 110s and DSM's Counter subtraction at 360s.

I'm going to use sorting for performance, but I'm accepting Steven's answer because the performance is acceptable and it feels clearer and more Pythonic.

Edit: discovered the Pandas solution. If Pandas is available it's clear and performs well.

解决方案

I think this is most clear done outside of numpy. You'll have to time it against your numpy solutions if you are concerned with speed.

>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).iteritems() if count > 1]
[1, 3]

note: This is similar to Burhan Khalid's answer, but the use of iteritems without subscripting in the condition should be faster.

这篇关于确定数组中的重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆