用Cython查找数组中所有唯一元素的最快方法 [英] Fastest way to find all unique elements in an array with Cython

查看:71
本文介绍了用Cython查找数组中所有唯一元素的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试寻找性能最高的方法,以从NumPy数组中查找唯一值. NumPy的unique函数非常慢,在找到唯一值之前先对值进行排序. Pandas使用 klib C库对值进行哈希处理,速度更快.我正在寻找Cython解决方案.

I am attempting to find the most performant method to find unique values from a NumPy array. NumPy's unique function is very slow and sorts the values first before finding the unique. Pandas hashes the values using the klib C library which is much faster. I am looking for a Cython solution.

最简单的解决方案似乎只是遍历数组并使用Python集来添加每个元素,如下所示:

The simplest solution seems to just iterate through the array and use a Python set to add each element like this:

from numpy cimport ndarray
from cpython cimport set

@cython.wraparound(False)
@cython.boundscheck(False)
def unique_cython_int(ndarray[np.int64_t] a):
    cdef int i
    cdef int n = len(a)
    cdef set s = set()
    for i in range(n):
        s.add(a[i])
    return s

我还尝试了使用c ++的unordered_set

I also tried an unordered_set from c++

from libcpp.unordered_set cimport unordered_set
@cython.wraparound(False)
@cython.boundscheck(False)
def unique_cpp_int(ndarray[np.int64_t] a):
    cdef int i
    cdef int n = len(a)
    cdef unordered_set[int] s
    for i in range(n):
        s.insert(a[i])
    return s

性能

# create array of 1,000,000
a = np.random.randint(0, 50, 1000000)

# Pure Python
%timeit set(a)
86.4 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Convert to list first
a_list = a.tolist()
%timeit set(a_list)
10.2 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# NumPy
%timeit np.unique(a)
32 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Pandas
%timeit pd.unique(a)
5.3 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Cython
%timeit unique_cython_int(a)
13.4 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Cython - c++ unordered_set
%timeit unique_cpp_int(a)
17.8 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

讨论

因此,大熊猫的速度大约是Cythonized套装的2.5倍.当有更多不同的元素时,其铅含量会增加.令人惊讶的是,一个纯python集(在列表中)击败了一个cythonized集.

Discussion

So pandas is about 2.5x faster than a cythonized set. Its lead increases when there are more distinct elements. Surprisingly, a pure python set (on a list) beats out a cythonized set.

我的问题-在Cython中,有没有比重复使用add方法更快的方法?可以改善c ++ unordered_set吗?

My question here - is there a faster way to do this in Cython than just use the add method repeatedly? And could the c++ unordered_set be improved?

当我们使用unicode字符串时,故事发生了变化.我相信我必须将numpy数组转换为object数据类型,以便为Cython正确添加其类型.

The story changes when we use unicode strings. I believe I have to convert the numpy array to an object data type to properly add its type for Cython.

@cython.wraparound(False)
@cython.boundscheck(False)
def unique_cython_str(ndarray[object] a):
    cdef int i
    cdef int n = len(a)
    cdef set s = set()
    for i in range(n):
        s.add(a[i])
    return s

然后我再次尝试了c ++中的unordered_set

And again I tried an unordered_set from c++

@cython.wraparound(False)
@cython.boundscheck(False)
def unique_cpp_str(ndarray[object] a):
    cdef int i
    cdef int n = len(a)
    cdef unordered_set[string] s
    for i in range(n):
        s.insert(a[i])
    return s

性能

创建一百万个具有1,000个不同值的字符串的数组

Create an array of 1 million strings with 1,000 distinct values

s_1000 = []
for i in range(1000):
    s = np.random.choice(list('abcdef'), np.random.randint(5, 50))
    s_1000.append(''.join(s))

s_all = np.random.choice(s_1000, 1000000)

# s_all has numpy unicode as its data type. Must convert to object
s_unicode_obj = s_all.astype('O')

# c++ does not easily handle unicode. Convert to bytes and then to object
s_bytes_obj = s_all.astype('S').astype('O')

# Pure Python
%timeit set(s_all)
451 ms ± 5.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

%timeit set(s_unicode_obj)
71.9 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# using set on a list
s_list = s_all.tolist()
%timeit set(s_list)
63.1 ms ± 7.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# NumPy
%timeit np.unique(s_unicode_obj)
1.69 s ± 97.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.unique(s_all)
633 ms ± 3.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Pandas
%timeit pd.unique(s_unicode_obj)
97.6 ms ± 6.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Cython
%timeit unique_cython_str(s_unicode_obj)
60 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Cython - c++ unordered_set
%timeit unique_cpp_str2(s_bytes_obj)
247 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

讨论

因此,看来Python的设置优于unicode字符串的熊猫,但不优于整数.再说一次,遍历Cython中的数组并没有真正帮助我们.

Discussion

So, it appears that Python's set outperforms pandas for unicode strings but not on integers. And again, iterating through the array in Cython doesn't really help us at all.

如果您知道整数的范围不太疯狂,则可以绕过集合.您可以简单地创建第二个全零/False的数组,并在遇到每个零/False时将其位置True并将其数字附加到列表中.由于没有进行哈希处理,因此速度非常快.

It's possible to circumvent sets if you know the range of your integers isn't too crazy. You can simply create a second array of all zeros/False and turn their position True when you encounter each one and append that number to a list. This is extremely fast since no hashing is done.

以下内容适用于正整数数组.如果您使用负整数,则必须添加一个常数,以将数字最多移至0.

The following works for positive integer arrays. If you had negative integers, you would have to add a constant to shift the numbers up to 0.

@cython.wraparound(False)
@cython.boundscheck(False)
def unique_bounded(ndarray[np.int64_t] a):
    cdef int i, n = len(a)
    cdef ndarray[np.uint8_t, cast=True] unique = np.zeros(n, dtype=bool)
    cdef list result = []
    for i in range(n):
        if not unique[a[i]]:
            unique[a[i]] = True
            result.append(a[i])
    return result

%timeit unique_bounded(a)
1.18 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

缺点当然是内存使用情况,因为您最大的整数可能会强制使用非常大的数组.但是,如果您确切知道每个数字有多少个有效数字,则此方法也适用于浮点数.

The downside is of course memory usage since your largest integer could force an extremely large array. But this method could work for floats too if you knew precisely how many significant digits each number had.

整数50个,共1,000,000个

Integers 50 unique of 1,000,000 total

  • 熊猫-5毫秒
  • Python列表集-10毫秒
  • Cython设置-13毫秒
  • 作弊",整数-1.2毫秒

1,000个唯一的字符串,共1,000,000个

Strings 1,000 unique of 1,000,000 total

  • Cython设置-60毫秒
  • Python列表集-63毫秒
  • 熊猫-98毫秒

感谢所有帮助,使它们变得更快.

Appreciate all the help making these faster.

推荐答案

我认为,什么是找到独特元素的最快方法"的答案是取决于".这取决于您的数据集和硬件.

I think the answer to you question "what is the fastest way to find unique elements" is "it depends". It depends on your data set and on your hardware.

对于您的情况(我主要研究整数情况),pandas(并使用了khash)做得相当不错.我无法使用std::unordered_map达到此性能.

For your scenarios (I mostly looked at integer case) pandas (and used khash) does a pretty decent job. I was not able to match this performance using std::unordered_map.

但是,google::dense_hash_set在我的实验中比熊猫解决方案要快一些.

However, google::dense_hash_set was slightly faster in my experiments than the pandas-solution.

请继续阅读以获取更详细的说明.

Please read on for a more detailed explanation.

我首先要解释您所观察的结果,并在以后使用这些见解.

I would like to start out by explaining the results you are observing and use these insights later on.

我从您的int-example开始:数组中只有50个唯一元素,但只有1,000,000

I start with your int-example: there are only 50 unique elements but 1,000,000 in the array:

import numpy as np
import pandas as pd
a=np.random.randint(0,50, 10**6, dtype=np.int64)

作为基准,我的计算机的np.unique()pd.unique()的计时:

As baseline the timings of np.unique() and pd.unique() for my machine:

%timeit np.unique(a)
>>>82.3 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(a)
>>>9.4 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

带有集合(O(n))的

pandas方法比numpy进行排序(O(nlogn))的方法快约10倍. log n = 20表示n=10**6,因此因子10约为预期的差异.

pandas approach with the set (O(n)) is about 10 times faster than numpy's approach with sorting (O(nlogn)). log n = 20 for n=10**6, so the factor 10 is about the expected difference.

另一个区别是,np.unique返回一个排序后的数组,因此可以使用二进制搜索来查找元素. pd.unique返回一个未排序的数组,因此我们需要对其进行排序(如果原始数据中没有太多重复项,则可能为O(n log n))或将其转换为类似集合的结构.

Another difference is, that np.unique returns a sorted array, so one could use binary search to look up the elements. pd.unique returns an unsorted array so we need either to sort it (which might be O(n log n) if there are not many duplicates in the original data) or to transform it to a set-like structure.

让我们看一下简单的Python集合:

Let's take a look at the simple Python-Set:

%timeit set(a)
>>> 257 ms ± 21.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我们必须在这里意识到的第一件事:我们正在比较苹果和橙子.以前的unique函数返回numpy数组,该数组由低级c整数组成.这将返回一组完整的Python整数.完全不一样!

First thing we must be aware here: we are comparing apples and oranges. The previous unique-functions return numpy arrays, which consists out of lowly c-integers. This one returns a set of full-fledged Python-integers. Quite a different thing!

这意味着对于numpy数组中的每个元素,我们必须首先创建一个python对象-相当大的开销,然后才可以将其添加到集合中.

That means for every element in the numpy-array we must first create a python-object - quite an overhead and only then can we add it to the set.

到Python整数的转换可以在预处理步骤中完成-您使用的是list版本:

The conversion to Python-integers can be done in a preprocessing step - your version with list:

A=list(a)
%timeit set(A)
>>> 104 ms ± 952 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit set(list(a))
>>> 270 ms ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

创建Python整数需要100毫秒以上的时间.但是,python-integers比低级C-int更复杂,因此处理它们的成本更高.在C-int上使用pd.unique并提升为Python设置要快得多.

More than 100 ms are needed for the creation of the Python-integers. However, the python-integers are more complex than the lowly C-ints and thus handling them costs more. Using pd.unique on C-int and than promoting to Python-set is much faster.

现在是您的Cython版本:

And now your Cython version:

%timeit unique_cython_int(a)
31.3 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

我不太了解.我希望它的执行与set(a)类似-cython会切出解释器,但不能解释因子10.但是,我们只有50个不同的整数(它们甚至在整数池中,因为它们小于256),因此可能需要进行一些优化,以发挥作用/差异.

That I don't really understand. I would expect it to perform similar to set(a) -cython would cut out the interpreter, but that would not explain the factor 10. However, we have only 50 different integers (which are even in the integers-pool because they are smaller than 256), so there is probably some optimization, which plays a role/difference.

让我们尝试另一个数据集(现在有10**5个不同的数字):

Let's try another data-set (there are now 10**5 different numbers):

b=np.random.randint(0, 10**5,10**6, dtype=np.int64)
%timeit unique_cython_int(b)
>>> 236 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit set(b)
>>> 388 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我希望加速小于2.

让我们看一下cpp版本:

Let's take a look at cpp-version:

%timeit unique_cpp_int(a)
>>> 25.4 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit unique_cpp_int(b)
>>> 100 ms ± 4.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

将数据从cpp集复制到Python集会产生一些开销(正如DavidW指出的那样),但是根据我的经验,我会期望的行为:std::unordered_map比Python快一些,但不是最佳的实现方式-熊猫似乎胜过了它:

There is some overhead in copying the data from the cpp-set to the Python set (as DavidW have pointed out), but otherwise the behavior as I would expect given my experience with it: std::unordered_map is somewhat faster than Python, but not the greatest implementation around - panda seems to beat it:

%timeit set(pd.unique(b))
>>> 45.8 ms ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

因此,在重复次数很多且哈希函数便宜的情况下,熊猫解决方案很难被击败.

So it looks like, that in the situation, where there are many duplicated and the hash-function is cheap, the pandas-solution is hard to beat.

也许可以尝试使用 Google数据结构.

但是,当数据只有很少的重复项时,numpy的排序解决方案可能会变得更快.主要原因是numpy的unique仅需要两倍的内存-原始数据和输出,而pandas hash-set-solution需要更多的内存:原始数据,集合和输出.对于庞大的数据集,这可能会变成拥有足够的RAM和没有足够的RAM之间的区别.

However, when the data has only very few duplicates, the numpy's sorting solution may become the faster one. The main reason is, that numpy's unique needs only twice the memory - the original data and the output, while pandas hash-set-solution needs much more memory: the original data, the set and the output. For huge datasets it might become the difference between having enough RAM and not having enough RAM.

这取决于设置实现需要多少内存开销,并且总是关于内存和速度之间的权衡.例如,std::unordered_set至少需要32个字节来保存8个字节的整数.某些Google的数据结构可以做得更好.

It depends on the set-implementation how much memory-overhead is needed and it is always about the trade-off between memory and speed. For example std::unordered_set needs at least 32 byte to save a 8-byte integer. Some google's data structures can do better.

运行/usr/bin/time -fpeak_used_memory:%M python check_mem.py且熊猫/numpy唯一:

Running /usr/bin/time -fpeak_used_memory:%M python check_mem.py with pandas/numpy unique:

#check_mem.py
import numpy as np
import pandas as pd
c=np.random.randint(0, 2**63,5*10**7, dtype=np.int64)
#pd.unique(c)  
np.unique(c)

对于numpy显示1.2 GB,对于pandas显示2.0GB.

shows 1.2 GB for numpy and 2.0GB for pandas.

实际上,在我的Windows计算机上,如果数组中仅存在(唯一)唯一元素(甚至是仅" 10^6元素),则np.uniquepd.unique快(可能是因为需要重新整理使用的集会增长).但是,对于我的Linux机器却不是这种情况.

Actually, on my Windows machine np.unique is faster than pd.unique if there are (next to) only unique elements in the array, even for "only" 10^6 elements (probably because of the needed rehashes as the used set grows). This is however not the case for my Linux machine.

pandas不发光的另一种情况是哈希函数的计算不便宜时:将长字符串(假设1000字符)作为对象.

Another scenario in which pandas doesn't shine is when the calculation of the hash function is not cheap: Consider long strings (let's say of 1000 characters) as objects.

要计算散列值,需要考虑所有1000个字符(这意味着大量数据->许多散列未命中),两个字符串的比较通常在一个或两个字符之后进行-概率那么已经很高了,我们知道字符串是不同的.因此,numpy的uniquelog n因素看起来不再那么糟糕了.

To calculate the hash-value one needs to consider all 1000 characters (which means a lot of data-> a lot of hash misses), the comparison of two strings is mostly done after one or two characters - the probability is then already very high, that we know that the strings are different. So the log n factor of the numpy's unique doesn't look that bad anymore.

在这种情况下,最好使用树集而不是哈希集.

It could be better to use a tree-set instead of a hash-set in this case.

改进cpp无序集:

使用cpp的无序集的方法由于其方法reserve()可以得到改进,从而消除了重新哈希的需要.但是它没有导入到cython中,因此从Cython中使用非常麻烦.

The method using cpp's unordered set could be improved due to its method reserve(), which would eliminate the need for rehashing. But it is not imported to cython, so the usage is quite cumbersome from Cython.

对于只有50个唯一元素的数据,对于几乎所有元素都是唯一的数据,保留不会对数据的运行时间产生任何影响,最多不超过2倍(由于使用了调整大小策略而产生的摊余成本).

The reserving however would not have any impact on the runtimes for data with only 50 unique elements and at most factor 2 (amortized costs due to the used resize-strategy) for the data with almost all elements unique.

ints的哈希函数是身份(对于gcc至少是 ),所以不是在这里有很多收获(我认为使用更高级的哈希函数不会在这里有所帮助).

The hash-function for ints is identity (at least for gcc), so not much to gain here (I don't think using a more fancy hash-function would help here).

我看不出如何调整cpp的无序集以克服熊猫使用的khash实现,这对于这种类型的任务似乎非常有用.

I see no way how cpp's unordered-set could be tweaked to beat the khash-implementation used by pandas, which seems to be quite good for this type of tasks.

例如,这里有这些相当古老的基准,表明khashstd::unordered_map快一点,只有google_dense更快.

Here are for example these pretty old benchmarks, which show that khash is somewhat faster than std::unordered_map with only google_dense being even faster.

使用Google密集地图:

在我的实验中,谷歌密集地图(来自此处)能够击败khash-基准代码可以在答案的末尾找到.

In my experiments, google dense map (from here) was able to beat khash - benchmark code can be found at the end of the answer.

只有50个唯一元素会更快:

It was faster if there were only 50 unique elements:

#50 unique elements:
%timeit google_unique(a,r)
1.85 ms ± 8.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.unique(a)
3.52 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

但如果只有唯一元素,也会更快:

but also faster if there were only unique elements:

%timeit google_unique(c,r)
54.4 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: %timeit pd.unique(c)
75.4 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

我的一些实验也表明,google_hash_set可能比khash使用更多的内存(最多20%),但是需要更多测试才能确定是否确实如此.

My few experiments have also shown, that google_hash_set uses maybe more memory (up to 20%) than khash, but more tests are needed to see whether this is really the case.

我不确定我的回答是否对您有任何帮助.我的要点是:

I'm not sure my answer helped you at all. My take-aways are:

  1. 如果我们需要一组Python整数,则set(pd.unique(...))似乎是一个很好的起点.
  2. 在某些情况下,numpy的排序解决方案可能会更好(内存较少,有时哈希计算过于昂贵)
  3. 通过更好地权衡(例如,使用更少/更多的内存/预分配,因此我们无需重新哈希或使用位集进行查找),就可以将更多有关数据的知识用于调整解决方案.
  4. 在某些常见情况下,Pandas解决方案似乎已进行了很好的调整,但在其他情况下,另一种折衷可能会更好-google_dense是最有前途的候选人.
  1. If we need a set of Python-integers, set(pd.unique(...)) seems to be a good starting point.
  2. There are some cases for which numpy's sorting solution might be better (less memory, sometimes hash-calculation is too expensive)
  3. Knowing more about data can be used to tweak the solution, by making a better trade-off (e.g. using less/more memory/preallocating so we don't need to rehash or to use a bitset for look-up).
  4. Pandas solution seems to be tweaked pretty good for some usual cases, but then for other cases another trade-off might be better - google_dense being the most promising candidate.


Google测试列表:


Listings for google-tests:

#google_hash.cpp
#include <cstdint>
#include <functional>
#include <sparsehash/dense_hash_set>

typedef int64_t lli;
void cpp_unique(lli *input, int n, lli *output){

  google::dense_hash_set<lli, std::hash<lli> > set;
  set.set_empty_key(-1);
  for (int i=0;i<n;i++){
     set.insert(input[i]);
  }  

  int cnt=0;
  for(auto x : set)
    output[cnt++]=x;
}

对应的pyx文件:

#google.pyx
cimport numpy as np
cdef extern from "google_hash.cpp":
    void cpp_unique(np.int64_t  *inp, int n, np.int64_t *output)

#out should have enough memory:
def google_unique(np.ndarray[np.int64_t,ndim=1] inp, np.ndarray[np.int64_t,ndim=1] out):
    cpp_unique(&inp[0], len(inp), &out[0])

setup.py文件:

the setup.py-file:

from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy as np

setup(ext_modules=cythonize(Extension(
            name='google',
            language='c++',
            extra_compile_args=['-std=c++11'],
            sources = ["google.pyx"],
            include_dirs=[np.get_include()]
    )))

调用python setup.py build_ext --inplace后的Ipython基准脚本:

Ipython-benchmark script, after calling python setup.py build_ext --inplace:

import numpy as np
import pandas as pd
from google import google_unique

a=np.random.randint(0,50,10**6,dtype=np.int64)
b=np.random.randint(0, 10**5,10**6, dtype=np.int64)
c=np.random.randint(0, 2**63,10**6, dtype=np.int64)
r=np.zeros((10**6,), dtype=np.int64)

%timeit google_unique(a,r
%timeit pd.unique(a)


其他列表


Other listings

修复后的Cython版本:

Cython version after fixes:

%%cython
cimport cython
from numpy cimport ndarray
from cpython cimport set
cimport numpy as np
@cython.wraparound(False)
@cython.boundscheck(False)
def unique_cython_int(ndarray[np.int64_t] a):
    cdef int i
    cdef int n = len(a)
    cdef set s = set()
    for i in range(n):
        s.add(a[i])
    return s

修复后的C ++版本:

C++ version after fixes:

%%cython -+ -c=-std=c++11
cimport cython
cimport numpy as np
from numpy cimport ndarray
from libcpp.unordered_set cimport unordered_set
@cython.wraparound(False)
@cython.boundscheck(False)
def unique_cpp_int(ndarray[np.int64_t] a):
    cdef int i
    cdef int n = len(a)
    cdef unordered_set[int] s
    for i in range(n):
        s.insert(a[i])
    return s

这篇关于用Cython查找数组中所有唯一元素的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆