有效地确定大型排序的numpy数组是否只有唯一值 [英] Efficiently determining if large sorted numpy array has only unique values
问题描述
我有一个非常大的numpy数组,我想对其进行排序并测试其是否唯一.
I have a very large numpy array and I want to sort it and test if it is unique.
我知道函数numpy.unique
,但是它再次对数组进行排序以实现该功能.
I'm aware of the function numpy.unique
but it sorts the array another time to achieve it.
我需要对数组进行先验排序的原因是因为argsort
函数返回的键将用于对另一个数组进行重新排序.
The reason I need the array sorted a priori is because the returned keys from the argsort
function will be used to reorder another array.
我正在寻找一种方法(argsort
和唯一测试),而无需再次对数组进行排序.
I'm looking for a way to do both (argsort
and unique test) without the need to sort the array again.
示例代码:
import numpy as np
import numpy.random
# generating random arrays with 2 ^ 27 columns (it can grow even bigger!)
slices = np.random.random_integers(2 ** 32, size = 2 ** 27)
values = np.random.random_integers(2 ** 32, size = 2 ** 27)
# get an array of keys to sort slices AND values
# this operation takes a long time
sorted_slices = slices.argsort()
# sort both arrays
# it would be nice to make this operation in place
slices = slices[sorted_slices]
values = values[sorted_slices]
# test 'uniqueness'
# here, the np.unique function sorts the array again
if slices.shape[0] == np.unique(slices).shape[0]:
print('it is unique!')
else:
print('not unique!')
数组slices
和values
都具有1行和相同(大)列数.
Both the arrays slices
and values
have 1 row and the same (huge) number of columns.
提前谢谢.
推荐答案
通过将它们的差异与0
进行比较,可以检查是否存在两个或更多个彼此相等的值(排序数组中的非唯一值)
You can check whether there are two or more equal values next to each other (non-unique values in a sorted array) by comparing their difference to 0
numpy.any(numpy.diff(slices) == 0)
请注意,尽管numpy将创建两个中间数组:一个具有差值,一个具有布尔值.
Be aware though that numpy will create two intermediate arrays: one with the difference values, one with boolean values.
这篇关于有效地确定大型排序的numpy数组是否只有唯一值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!