确定数组中的重复值 [英] Determining duplicate values in an array
问题描述
假设我有一个数组
a = np.array([1,2,1,3,3,3 ,0])
我如何(高效地,Python)找到哪些元素 a
是重复的(即非唯一值)?在这种情况下,结果将是 array([1,3,3])
或可能 array([1,3])
如果有效率。
我已经提出了一些似乎有效的方法:
掩饰
m = np.zeros_like(a,dtype = bool)
m [np.unique(a,return_index = True)[1]] = True
a [〜m]
设置操作
a [〜np.in1d(np.arange(len(a)),np.unique(a,return_index = True)[1 ],假设_unique = True)]
这个可爱但可能是非法的(如 a
实际上不是唯一的):
np.setxor1d(a,np.unique a),puts_unique = True)
直方图
u,i = np.unique(a,return_inverse = True)
u [np.bincount(i)> 1]
排序
s = np.sort(a,axis = None)
pre>
s [s [1:] == s [: - 1]]
熊猫
s = pd.Series(a)
s [s.duplicated()]
有没有什么我错过的?我不一定要寻找一个仅限numpy的解决方案,但它必须使用numpy数据类型,并在中型数据集(高达1000万的大小)上有效。
结论
使用1000万大小的数据集(2.8GHz至强)进行测试: / p>
a = np.random.randint(10 ** 7,size = 10 ** 7)
最快的是排序,在1.1s。可疑的
xor1d
在2.6秒之后是第二,其次是3.1s,$ $ c屏蔽和PandasSeries.duplicated
$ c> bincount 在5.6s和in1d
和senderle的setdiff1d
都在7.3s 。史蒂文的计数器
只有一点慢点,在10.5s;后面是Burhan的Counter.most_common
在110s和DSM的计数器
减去360s。
我要使用排序来表现,但是我接受了史蒂文的回答,因为表现是可以接受的,而且它更清晰,更多的是Pythonic。
编辑:发现熊猫解决方案。如果熊猫是可用的,它很清楚,效果很好。
解决方案我认为这是最明显的,在
numpy的
。如果您关心速度,您将不得不按照您的numpy
解决方案。>>>导入numpy为np
>>>>从集合导入计数器
>>>> a = np.array([1,2,1,3,3,3,0])
>>> [item for item,count in Counter(a).iteritems()if count> 1]
[1,3]
注意:类似于Burhan Khalid的答案,但在条件下使用
iteritems
而不需要下标。Suppose I have an array
a = np.array([1, 2, 1, 3, 3, 3, 0])
How can I (efficiently, Pythonically) find which elements of
a
are duplicates (i.e., non-unique values)? In this case the result would bearray([1, 3, 3])
or possiblyarray([1, 3])
if efficient.I've come up with a few methods that appear to work:
Masking
m = np.zeros_like(a, dtype=bool) m[np.unique(a, return_index=True)[1]] = True a[~m]
Set operations
a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]
This one is cute but probably illegal (as
a
isn't actually unique):np.setxor1d(a, np.unique(a), assume_unique=True)
Histograms
u, i = np.unique(a, return_inverse=True) u[np.bincount(i) > 1]
Sorting
s = np.sort(a, axis=None) s[s[1:] == s[:-1]]
Pandas
s = pd.Series(a) s[s.duplicated()]
Is there anything I've missed? I'm not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).
Conclusions
Testing with a 10 million size data set (on a 2.8GHz Xeon):
a = np.random.randint(10**7, size=10**7)
The fastest is sorting, at 1.1s. The dubious
xor1d
is second at 2.6s, followed by masking and PandasSeries.duplicated
at 3.1s,bincount
at 5.6s, andin1d
and senderle'ssetdiff1d
both at 7.3s. Steven'sCounter
is only a little slower, at 10.5s; trailing behind are Burhan'sCounter.most_common
at 110s and DSM'sCounter
subtraction at 360s.I'm going to use sorting for performance, but I'm accepting Steven's answer because the performance is acceptable and it feels clearer and more Pythonic.
Edit: discovered the Pandas solution. If Pandas is available it's clear and performs well.
解决方案I think this is most clear done outside of
numpy
. You'll have to time it against yournumpy
solutions if you are concerned with speed.>>> import numpy as np >>> from collections import Counter >>> a = np.array([1, 2, 1, 3, 3, 3, 0]) >>> [item for item, count in Counter(a).iteritems() if count > 1] [1, 3]
note: This is similar to Burhan Khalid's answer, but the use of
iteritems
without subscripting in the condition should be faster.这篇关于确定数组中的重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!