Pandas pd.Series.isin 性能与集合与数组 [英] Pandas pd.Series.isin performance with set versus array

查看:29
本文介绍了Pandas pd.Series.isin 性能与集合与数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常在 Python 中,最好通过 set 测试可散列集合的成员资格.我们知道这一点是因为散列的使用为 listnp.ndarray 提供了 O(1) 与 O(n) 的查找复杂度.

在 Pandas 中,我经常需要检查非常大的集合中的成员资格.我认为同样适用,即检查系列中的每个项目是否属于 set 比使用 listnp.ndarray 更有效>.然而,情况似乎并非如此:

将 numpy 导入为 np将熊猫导入为 pdnp.random.seed(0)x_set = {i for i in range(100000)}x_arr = np.array(list(x_set))x_list = 列表(x_set)arr = np.random.randint(0, 20000, 10000)ser = pd.Series(arr)lst = arr.tolist()%timeit ser.isin(x_set) # 8.9 毫秒%timeit ser.isin(x_arr) # 2.17 毫秒%timeit ser.isin(x_list) # 7.79 毫秒%timeit np.in1d(arr, x_arr) # 5.02 毫秒%timeit [i in x_set for i in lst] # 1.1 ms%timeit [i in x_set for i in ser.values] # 4.61 ms

用于测试的版本:

np.__version__ # '1.14.3'pd.__version__ #'0.23.0'sys.version # '3.6.5'

我们可以看到:对于大 n s,预处理的线性时间支配了 numpy 版本.从 numpy 到 pure-python 转换的版本(numpy->python)具有与 pure-python 版本相同的恒定行为,但由于必要的转换而速度较慢 - 这一切都符合我们的分析.

这在图中不能很好地看出:如果 n <;m numpy 版本变得更快——在这种情况下,khash-lib 的更快查找起到了最重要的作用,而不是预处理部分.

我从这次分析中得出的结论:

  • n :pd.Series.isin 应该被采用,因为 O(n)-预处理不是那么昂贵.

  • n >m:(可能是cythonized 版本)[i in x_set for i in ser.values] 应该被采用,从而避免O(n).p>

  • 显然有一个灰色区域,其中 nm 大致相等,如果不进行测试,很难判断哪种解决方案最好.

  • 如果您可以控制它:最好的办法是将 set 直接构建为 C 整数集 (khash (

    khash 比 numpy->python 快约 20 倍,比纯 python 快约 6 倍(但无论如何纯 python 不是我们想要的),甚至快约 3 倍比 cpp 的版本.

    <小时>

    列表

    1) 使用 valgrind 进行分析:

    #isin.py将 numpy 导入为 np将熊猫导入为 pdnp.random.seed(0)x_set = {i for i in range(2*10**6)}x_arr = np.array(list(x_set))arr = np.random.randint(0, 20000, 10000)ser = pd.Series(arr)对于 _ 范围(10):ser.isin(x_arr)

    现在:

    <预><代码>>>>valgrind --tool=callgrind python isin.py>>>缓存研磨

    导致以下调用图:

    B:生成运行时间的 ipython 代码:

    将 numpy 导入为 np将熊猫导入为 pd%matplotlib 内联导入 matplotlib.pyplot 作为 pltnp.random.seed(0)x_set = {i for i in range(10**2)}x_arr = np.array(list(x_set))x_list = 列表(x_set)arr = np.random.randint(0, 20000, 10000)ser = pd.Series(arr)lst = arr.tolist()n=10**3结果=[]而 n<3*10**6:x_set = {i for i in range(n)}x_arr = np.array(list(x_set))x_list = 列表(x_set)t1=%timeit -o ser.isin(x_arr)t2=%timeit -o [i in x_set for i in lst]t3=%timeit -o [i in x_set for i in ser.values]result.append([n, t1.average, t2.average, t3.average])n*=2#绘图结果:for_plot=np.array(结果)plt.plot(for_plot[:,0], for_plot[:,1], label='numpy')plt.plot(for_plot[:,0], for_plot[:,2], label='python')plt.plot(for_plot[:,0], for_plot[:,3], label='numpy->python')plt.xlabel('n')plt.ylabel('运行时间')plt.legend()plt.show()

    C: cpp-wrapper:

    %%cython --cplus -c=-std=c++11 -a来自 libcpp.unordered_set cimport unordered_setcdef 类 HashSet:cdef unordered_set[long long int] scpdef add(self, long long int z):self.s.insert(z)cpdef bint 包含(self,long long int z):返回 self.s.count(z)>0将 numpy 导入为 npcimport numpy 作为 npcimport cython@cython.boundscheck(假)@cython.wraparound(假)def isin_cpp(np.ndarray[np.int64_t, ndim=1] a, HashSet b):cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)cdef int i对于我在范围内(a.size):res[i]=b.contains(a[i])返回资源

    D:使用不同的 set-wrappers 绘制结果:

    将 numpy 导入为 np将熊猫导入为 pd%matplotlib 内联导入 matplotlib.pyplot 作为 plt从 cykash 导入 Int64Setnp.random.seed(0)x_set = {i for i in range(10**2)}x_arr = np.array(list(x_set))x_list = 列表(x_set)arr = np.random.randint(0, 20000, 10000)ser = pd.Series(arr)lst = arr.tolist()n=10**3结果=[]而 n<3*10**6:x_set = {i for i in range(n)}x_arr = np.array(list(x_set))cpp_set=HashSet()khash_set=Int64Set()对于 x_set 中的 i:cpp_set.add(i)khash_set.add(i)断言((ser.isin(x_arr).values==isin_cpp(ser.values, cpp_set)).all())断言((ser.isin(x_arr).values==isin_khash(ser.values, khash_set)).all())t1=%timeit -o isin_khash(ser.values, khash_set)t2=%timeit -o isin_cpp(ser.values, cpp_set)t3=%timeit -o [i in x_set for i in lst]t4=%timeit -o [i in x_set for i in ser.values]result.append([n, t1.average, t2.average, t3.average, t4.average])n*=2#绘图结果:for_plot=np.array(结果)plt.plot(for_plot[:,0], for_plot[:,1], label='khash')plt.plot(for_plot[:,0], for_plot[:,2], label='cpp')plt.plot(for_plot[:,0], for_plot[:,3], label='pure python')plt.plot(for_plot[:,0], for_plot[:,4], label='numpy->python')plt.xlabel('n')plt.ylabel('运行时间')ymin, ymax = plt.ylim()plt.ylim(0,ymax)plt.legend()plt.show()

    In Python generally, membership of a hashable collection is best tested via set. We know this because the use of hashing gives us O(1) lookup complexity versus O(n) for list or np.ndarray.

    In Pandas, I often have to check for membership in very large collections. I presumed that the same would apply, i.e. checking each item of a series for membership in a set is more efficient than using list or np.ndarray. However, this doesn't seem to be the case:

    import numpy as np
    import pandas as pd
    
    np.random.seed(0)
    
    x_set = {i for i in range(100000)}
    x_arr = np.array(list(x_set))
    x_list = list(x_set)
    
    arr = np.random.randint(0, 20000, 10000)
    ser = pd.Series(arr)
    lst = arr.tolist()
    
    %timeit ser.isin(x_set)                   # 8.9 ms
    %timeit ser.isin(x_arr)                   # 2.17 ms
    %timeit ser.isin(x_list)                  # 7.79 ms
    %timeit np.in1d(arr, x_arr)               # 5.02 ms
    %timeit [i in x_set for i in lst]         # 1.1 ms
    %timeit [i in x_set for i in ser.values]  # 4.61 ms
    

    Versions used for testing:

    np.__version__  # '1.14.3'
    pd.__version__  # '0.23.0'
    sys.version     # '3.6.5'
    

    The source code for pd.Series.isin, I believe, utilises numpy.in1d, which presumably means a large overhead for set to np.ndarray conversion.

    Negating the cost of constructing the inputs, the implications for Pandas:

    • If you know your elements of x_list or x_arr are unique, don't bother converting to x_set. This will be costly (both conversion and membership tests) for use with Pandas.
    • Using list comprehensions are the only way to benefit from O(1) set lookup.

    My questions are:

    1. Is my analysis above correct? This seems like an obvious, yet undocumented, result of how pd.Series.isin has been implemented.
    2. Is there a workaround, without using a list comprehension or pd.Series.apply, which does utilise O(1) set lookup? Or is this an unavoidable design choice and/or corollary of having NumPy as the backbone of Pandas?

    Update: On an older setup (Pandas / NumPy versions) I see x_set outperform x_arr with pd.Series.isin. So an additional question: has anything fundamentally changed from old to new to cause performance with set to worsen?

    %timeit ser.isin(x_set)                   # 10.5 ms
    %timeit ser.isin(x_arr)                   # 15.2 ms
    %timeit ser.isin(x_list)                  # 9.61 ms
    %timeit np.in1d(arr, x_arr)               # 4.15 ms
    %timeit [i in x_set for i in lst]         # 1.15 ms
    %timeit [i in x_set for i in ser.values]  # 2.8 ms
    
    pd.__version__  # '0.19.2'
    np.__version__  # '1.11.3'
    sys.version     # '3.6.0'
    

    解决方案

    This might not be obvious, but pd.Series.isin uses O(1)-look up per element.

    After an analysis, which proves the above statement, we will use its insights to create a Cython-prototype which can easily beat the fastest out-of-the-box-solution.


    Let's assume that the "set" has n elements and the "series" has m elements. The running time is then:

     T(n,m)=T_preprocess(n)+m*T_lookup(n)
    

    For the pure-python version, that means:

    • T_preprocess(n)=0 - no preprocessing needed
    • T_lookup(n)=O(1) - well known behavior of python's set
    • results in T(n,m)=O(m)

    What happens for pd.Series.isin(x_arr)? Obviously, if we skip the preprocessing and search in linear time we will get O(n*m), which is not acceptable.

    It is easy to see with help of a debugger or a profiler (I used valgrind-callgrind+kcachegrind), what is going on: the working horse is the function __pyx_pw_6pandas_5_libs_9hashtable_23ismember_int64. Its definition can be found here:

    • In a preprocessing step, a hash-map (pandas uses khash from klib) is created out of n elements from x_arr, i.e. in running time O(n).
    • m look-ups happen in O(1) each or O(m) in total in the constructed hash-map.
    • results in T(n,m)=O(m)+O(n)

    We must remember - the elements of numpy-array are raw-C-integers and not the Python-objects in the original set - so we cannot use the set as it is.

    An alternative to converting the set of Python-objects to a set of C-ints, would be to convert the single C-ints to Python-object and thus be able to use the original set. That is what happens in [i in x_set for i in ser.values]-variant:

    • No preprocessing.
    • m look-ups happen in O(1) time each or O(m) in total, but the look-up is slower due to necessary creation of a Python-object.
    • results in T(n,m)=O(m)

    Clearly, you could speed-up this version a little bit by using Cython.

    But enough theory, let's take a look at the running times for different ns with fixed ms:

    We can see: the linear time of preprocessing dominates the numpy-version for big ns. The version with conversion from numpy to pure-python (numpy->python) has the same constant behavior as the pure-python version but is slower, because of the necessary conversion - this all in accordance with our analysis.

    That cannot be seen well in the diagram: if n < m the numpy version becomes faster - in this case the faster look-up of khash-lib plays the most important role and not the preprocessing-part.

    My take-aways from this analysis:

    • n < m: pd.Series.isin should be taken because O(n)-preprocessing isn't that costly.

    • n > m: (probably cythonized version of) [i in x_set for i in ser.values] should be taken and thus O(n) avoided.

    • clearly there is a gray zone where n and m are approximately equal and it is hard to tell which solution is best without testing.

    • If you have it under your control: The best thing would be to build the set directly as a C-integer-set (khash (already wrapped in pandas) or maybe even some c++-implementations), thus eliminating the need for preprocessing. I don't know, whether there is something in pandas you could reuse, but it is probably not a big deal to write the function in Cython.


    The problem is that the last suggestion doesn't work out of the box, as neither pandas nor numpy have a notion of a set (at least to my limited knowledge) in their interfaces. But having raw-C-set-interfaces would be best of both worlds:

    • no preprocessing needed because values are already passed as a set
    • no conversion needed because the passed set consists of raw-C-values

    I've coded a quick and dirty Cython-wrapper for khash (inspired by the wrapper in pandas), which can be installed via pip install https://github.com/realead/cykhash/zipball/master and then used with Cython for a faster isin version:

    %%cython
    import numpy as np
    cimport numpy as np
    
    from cykhash.khashsets cimport Int64Set
    
    def isin_khash(np.ndarray[np.int64_t, ndim=1] a, Int64Set b):
        cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)
        cdef int i
        for i in range(a.size):
            res[i]=b.contains(a[i])
        return res
    

    As a further possibility the c++'s unordered_map can be wrapped (see listing C), which has the disadvantage of needing c++-libraries and (as we will see) is slightly slower.

    Comparing the approaches (see listing D for creating of timings):

    khash is about factor 20 faster than the numpy->python, about factor 6 faster than the pure python (but pure-python is not what we want anyway) and even about factor 3 faster than the cpp's-version.


    Listings

    1) profiling with valgrind:

    #isin.py
    import numpy as np
    import pandas as pd
    
    np.random.seed(0)
    
    x_set = {i for i in range(2*10**6)}
    x_arr = np.array(list(x_set))
    
    
    arr = np.random.randint(0, 20000, 10000)
    ser = pd.Series(arr)
    
    
    for _ in range(10):
       ser.isin(x_arr)
    

    and now:

    >>> valgrind --tool=callgrind python isin.py
    >>> kcachegrind
    

    leads to the following call graph:

    B: ipython code for producing the running times:

    import numpy as np
    import pandas as pd
    %matplotlib inline
    import matplotlib.pyplot as plt
    
    np.random.seed(0)
    
    x_set = {i for i in range(10**2)}
    x_arr = np.array(list(x_set))
    x_list = list(x_set)
    
    arr = np.random.randint(0, 20000, 10000)
    ser = pd.Series(arr)
    lst = arr.tolist()
    
    n=10**3
    result=[]
    while n<3*10**6:
        x_set = {i for i in range(n)}
        x_arr = np.array(list(x_set))
        x_list = list(x_set)
    
        t1=%timeit -o  ser.isin(x_arr) 
        t2=%timeit -o  [i in x_set for i in lst]
        t3=%timeit -o  [i in x_set for i in ser.values]
    
        result.append([n, t1.average, t2.average, t3.average])
        n*=2
    
    #plotting result:
    for_plot=np.array(result)
    plt.plot(for_plot[:,0], for_plot[:,1], label='numpy')
    plt.plot(for_plot[:,0], for_plot[:,2], label='python')
    plt.plot(for_plot[:,0], for_plot[:,3], label='numpy->python')
    plt.xlabel('n')
    plt.ylabel('running time')
    plt.legend()
    plt.show()
    

    C: cpp-wrapper:

    %%cython --cplus -c=-std=c++11 -a
    
    from libcpp.unordered_set cimport unordered_set
    
    cdef class HashSet:
        cdef unordered_set[long long int] s
        cpdef add(self, long long int z):
            self.s.insert(z)
        cpdef bint contains(self, long long int z):
            return self.s.count(z)>0
    
    import numpy as np
    cimport numpy as np
    
    cimport cython
    @cython.boundscheck(False)
    @cython.wraparound(False)
    
    def isin_cpp(np.ndarray[np.int64_t, ndim=1] a, HashSet b):
        cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)
        cdef int i
        for i in range(a.size):
            res[i]=b.contains(a[i])
        return res
    

    D: plotting results with different set-wrappers:

    import numpy as np
    import pandas as pd
    %matplotlib inline
    import matplotlib.pyplot as plt
    from cykhash import Int64Set
    
    np.random.seed(0)
    
    x_set = {i for i in range(10**2)}
    x_arr = np.array(list(x_set))
    x_list = list(x_set)
    
    
    arr = np.random.randint(0, 20000, 10000)
    ser = pd.Series(arr)
    lst = arr.tolist()
    
    n=10**3
    result=[]
    while n<3*10**6:
        x_set = {i for i in range(n)}
        x_arr = np.array(list(x_set))
        cpp_set=HashSet()
        khash_set=Int64Set()
    
        for i in x_set:
            cpp_set.add(i)
            khash_set.add(i)
    
    
        assert((ser.isin(x_arr).values==isin_cpp(ser.values, cpp_set)).all())
        assert((ser.isin(x_arr).values==isin_khash(ser.values, khash_set)).all())
    
    
        t1=%timeit -o  isin_khash(ser.values, khash_set)
        t2=%timeit -o  isin_cpp(ser.values, cpp_set) 
        t3=%timeit -o  [i in x_set for i in lst]
        t4=%timeit -o  [i in x_set for i in ser.values]
    
        result.append([n, t1.average, t2.average, t3.average, t4.average])
        n*=2
    
    #ploting result:
    for_plot=np.array(result)
    plt.plot(for_plot[:,0], for_plot[:,1], label='khash')
    plt.plot(for_plot[:,0], for_plot[:,2], label='cpp')
    plt.plot(for_plot[:,0], for_plot[:,3], label='pure python')
    plt.plot(for_plot[:,0], for_plot[:,4], label='numpy->python')
    plt.xlabel('n')
    plt.ylabel('running time')
    ymin, ymax = plt.ylim()
    plt.ylim(0,ymax)
    plt.legend()
    plt.show()
    

    这篇关于Pandas pd.Series.isin 性能与集合与数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆