Pandas pd.Series.isin的性能与集合与数组的关系 [英] Pandas pd.Series.isin performance with set versus array

查看：76 发布时间：2020/5/18 18:40:54 python performance pandas numpy series

本文介绍了Pandas pd.Series.isin的性能与集合与数组的关系的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

通常在Python中，最好通过set测试可散列集合的成员资格.我们之所以知道这一点，是因为使用散列可以为list或np.ndarray提供O(1)查找复杂性，而不是O(n).

在熊猫中，我经常不得不检查非常大的收藏中的会员资格.我假设同样适用，即检查系列中的每个项目是否属于set中的成员资格比使用list或np.ndarray更有效.但是，情况似乎并非如此:

import numpy as np
import pandas as pd

np.random.seed(0)

x_set = {i for i in range(100000)}
x_arr = np.array(list(x_set))
x_list = list(x_set)

arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

%timeit ser.isin(x_set)                   # 8.9 ms
%timeit ser.isin(x_arr)                   # 2.17 ms
%timeit ser.isin(x_list)                  # 7.79 ms
%timeit np.in1d(arr, x_arr)               # 5.02 ms
%timeit [i in x_set for i in lst]         # 1.1 ms
%timeit [i in x_set for i in ser.values]  # 4.61 ms

用于测试的版本:

np.__version__  # '1.14.3'
pd.__version__  # '0.23.0'
sys.version     # '3.6.5'

pd.Series.isin 使用 numpy.in1d ，大概意味着set到np.ndarray转换的开销很大.

否定构建投入的成本，对熊猫的影响:

如果您知道x_list或x_arr的元素是唯一的，请不要费心转换为x_set.与Pandas一起使用，这将是昂贵的(转换和成员资格测试).
使用列表推导是从O(1)集查找中受益的唯一方法.

我的问题是:

我的上述分析正确吗?这似乎是pd.Series.isin的实现方式的一个显而易见但尚未记录的结果.
有没有使用列表理解或pd.Series.apply的变通办法，而做是使用O(1)设置查找的吗?还是以NumPy作为熊猫的骨干是不可避免的设计选择和/或必然结果?

更新:在较旧的设置(熊猫/NumPy版本)上，我看到x_set的表现优于x_arr和pd.Series.isin.因此，还有一个问题:是否有任何东西从旧变新，以致导致set的性能变差?

%timeit ser.isin(x_set)                   # 10.5 ms
%timeit ser.isin(x_arr)                   # 15.2 ms
%timeit ser.isin(x_list)                  # 9.61 ms
%timeit np.in1d(arr, x_arr)               # 4.15 ms
%timeit [i in x_set for i in lst]         # 1.15 ms
%timeit [i in x_set for i in ser.values]  # 2.8 ms

pd.__version__  # '0.19.2'
np.__version__  # '1.11.3'
sys.version     # '3.6.0'

解决方案

这可能并不明显，但是pd.Series.isin使用O(1)-查找每个元素.

经过分析，证明了上面的陈述，我们将利用其洞察力创建一个Cython原型，该原型可以轻松击败最快的即用型解决方案.

让我们假设集合"具有n元素，而系列"具有m元素.然后运行时间为:

 T(n,m)=T_preprocess(n)+m*T_lookup(n)

对于纯python版本，这意味着:

T_preprocess(n)=0-无需预处理
T_lookup(n)=O(1)-python集合的众所周知行为
得出T(n,m)=O(m)

pd.Series.isin(x_arr)会发生什么?显然，如果跳过预处理并在线性时间内搜索，则会得到O(n*m)，这是不可接受的.

在调试器或探查器(我使用valgrind-callgrind + kcachegrind)的帮助下很容易看出来，发生了什么:工作马是函数__pyx_pw_6pandas_5_libs_9hashtable_23ismember_int64.可以在此处找到它的定义>:

在预处理步骤中，从klib中创建了一个哈希图(熊猫使用喀什) x_arr中的n个元素，即在运行时间O(n)中.

O(1)

O(m)

m个查找.
得出T(n,m)=O(m)+O(n)

我们必须记住-numpy-array的元素是原始C-整数，而不是原始集合中的Python对象-因此我们不能按原样使用集合.

将一组Python对象转换为一组C-int的替代方法是将单个C-int转换为Python对象，从而能够使用原始的C-int.这就是[i in x_set for i in ser.values] -variant:

中发生的情况

不进行预处理.
m次查找都在O(1)时间或总计O(m)中进行，但是由于必须创建Python对象，因此查找速度较慢.
得出T(n,m)=O(m)

很显然，您可以使用Cython加快该版本的速度.

但是有足够的理论，让我们看一下固定m的不同n的运行时间:

我们可以看到:对于大的n，预处理的线性时间主导着numpy版本.从numpy转换为纯python(numpy->python)的版本与纯python版本具有相同的恒定行为，但由于必须进行转换而速度较慢-所有这些都符合我们的分析.

在图中不能很好地看到:如果n < m numpy版本变得更快-在这种情况下，khash -lib的快速查找将扮演最重要的角色，而不是预处理部分.

我从这项分析中得出的结论:

n < m:pd.Series.isin应该被采用，因为O(n)-预处理的成本并不高.
n > m :(应该是cythonized版本的)[i in x_set for i in ser.values]，因此应避免使用O(n).
显然有一个灰色区域，其中n和m大致相等，如果不进行测试，很难确定哪种解决方案是最佳的.

如果您可以控制它:最好的方法是直接将set作为C整数集(khash( Cython-wrapper for khash (受熊猫包装程序的启发) )，可以通过pip install https://github.com/realead/cykhash/zipball/master安装，然后与Cython配合使用，以实现更快的isin版本:

%%cython
import numpy as np
cimport numpy as np

from cykhash.khashsets cimport Int64Set

def isin_khash(np.ndarray[np.int64_t, ndim=1] a, Int64Set b):
    cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)
    cdef int i
    for i in range(a.size):
        res[i]=b.contains(a[i])
    return res

作为另外一种可能，可以包装c ++的unordered_map(请参见清单C)，这具有需要c ++库的缺点，并且(我们将看到)速度稍慢.

比较方法(请参见清单D中的计时创建方法):

khash比numpy->python快20倍，比纯python快6倍(但是，无论如何，纯python并不是我们想要的)，甚至比cpp的版本快3倍. >

列表

1)使用valgrind进行分析:

#isin.py
import numpy as np
import pandas as pd

np.random.seed(0)

x_set = {i for i in range(2*10**6)}
x_arr = np.array(list(x_set))


arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)


for _ in range(10):
   ser.isin(x_arr)

现在:

>>> valgrind --tool=callgrind python isin.py
>>> kcachegrind

导致以下调用图:

B:用于生成运行时间的ipython代码:

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

np.random.seed(0)

x_set = {i for i in range(10**2)}
x_arr = np.array(list(x_set))
x_list = list(x_set)

arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

n=10**3
result=[]
while n<3*10**6:
    x_set = {i for i in range(n)}
    x_arr = np.array(list(x_set))
    x_list = list(x_set)

    t1=%timeit -o  ser.isin(x_arr) 
    t2=%timeit -o  [i in x_set for i in lst]
    t3=%timeit -o  [i in x_set for i in ser.values]

    result.append([n, t1.average, t2.average, t3.average])
    n*=2

#plotting result:
for_plot=np.array(result)
plt.plot(for_plot[:,0], for_plot[:,1], label='numpy')
plt.plot(for_plot[:,0], for_plot[:,2], label='python')
plt.plot(for_plot[:,0], for_plot[:,3], label='numpy->python')
plt.xlabel('n')
plt.ylabel('running time')
plt.legend()
plt.show()

C:cpp包装器:

%%cython --cplus -c=-std=c++11 -a

from libcpp.unordered_set cimport unordered_set

cdef class HashSet:
    cdef unordered_set[long long int] s
    cpdef add(self, long long int z):
        self.s.insert(z)
    cpdef bint contains(self, long long int z):
        return self.s.count(z)>0

import numpy as np
cimport numpy as np

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)

def isin_cpp(np.ndarray[np.int64_t, ndim=1] a, HashSet b):
    cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)
    cdef int i
    for i in range(a.size):
        res[i]=b.contains(a[i])
    return res

D:使用不同的包装器绘制结果:

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from cykhash import Int64Set

np.random.seed(0)

x_set = {i for i in range(10**2)}
x_arr = np.array(list(x_set))
x_list = list(x_set)


arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

n=10**3
result=[]
while n<3*10**6:
    x_set = {i for i in range(n)}
    x_arr = np.array(list(x_set))
    cpp_set=HashSet()
    khash_set=Int64Set()

    for i in x_set:
        cpp_set.add(i)
        khash_set.add(i)


    assert((ser.isin(x_arr).values==isin_cpp(ser.values, cpp_set)).all())
    assert((ser.isin(x_arr).values==isin_khash(ser.values, khash_set)).all())


    t1=%timeit -o  isin_khash(ser.values, khash_set)
    t2=%timeit -o  isin_cpp(ser.values, cpp_set) 
    t3=%timeit -o  [i in x_set for i in lst]
    t4=%timeit -o  [i in x_set for i in ser.values]

    result.append([n, t1.average, t2.average, t3.average, t4.average])
    n*=2

#ploting result:
for_plot=np.array(result)
plt.plot(for_plot[:,0], for_plot[:,1], label='khash')
plt.plot(for_plot[:,0], for_plot[:,2], label='cpp')
plt.plot(for_plot[:,0], for_plot[:,3], label='pure python')
plt.plot(for_plot[:,0], for_plot[:,4], label='numpy->python')
plt.xlabel('n')
plt.ylabel('running time')
ymin, ymax = plt.ylim()
plt.ylim(0,ymax)
plt.legend()
plt.show()

In Python generally, membership of a hashable collection is best tested via set. We know this because the use of hashing gives us O(1) lookup complexity versus O(n) for list or np.ndarray.

In Pandas, I often have to check for membership in very large collections. I presumed that the same would apply, i.e. checking each item of a series for membership in a set is more efficient than using list or np.ndarray. However, this doesn't seem to be the case:

import numpy as np
import pandas as pd

np.random.seed(0)

x_set = {i for i in range(100000)}
x_arr = np.array(list(x_set))
x_list = list(x_set)

arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

%timeit ser.isin(x_set)                   # 8.9 ms
%timeit ser.isin(x_arr)                   # 2.17 ms
%timeit ser.isin(x_list)                  # 7.79 ms
%timeit np.in1d(arr, x_arr)               # 5.02 ms
%timeit [i in x_set for i in lst]         # 1.1 ms
%timeit [i in x_set for i in ser.values]  # 4.61 ms

Versions used for testing:

np.__version__  # '1.14.3'
pd.__version__  # '0.23.0'
sys.version     # '3.6.5'

The source code for pd.Series.isin, I believe, utilises numpy.in1d, which presumably means a large overhead for set to np.ndarray conversion.

Negating the cost of constructing the inputs, the implications for Pandas:

If you know your elements of x_list or x_arr are unique, don't bother converting to x_set. This will be costly (both conversion and membership tests) for use with Pandas.
Using list comprehensions are the only way to benefit from O(1) set lookup.

My questions are:

Is my analysis above correct? This seems like an obvious, yet undocumented, result of how pd.Series.isin has been implemented.
Is there a workaround, without using a list comprehension or pd.Series.apply, which does utilise O(1) set lookup? Or is this an unavoidable design choice and/or corollary of having NumPy as the backbone of Pandas?

Update: On an older setup (Pandas / NumPy versions) I see x_set outperform x_arr with pd.Series.isin. So an additional question: has anything fundamentally changed from old to new to cause performance with set to worsen?

%timeit ser.isin(x_set)                   # 10.5 ms
%timeit ser.isin(x_arr)                   # 15.2 ms
%timeit ser.isin(x_list)                  # 9.61 ms
%timeit np.in1d(arr, x_arr)               # 4.15 ms
%timeit [i in x_set for i in lst]         # 1.15 ms
%timeit [i in x_set for i in ser.values]  # 2.8 ms

pd.__version__  # '0.19.2'
np.__version__  # '1.11.3'
sys.version     # '3.6.0'

解决方案

This might not be obvious, but pd.Series.isin uses O(1)-look up per element.

After an analysis, which proves the above statement, we will use its insights to create a Cython-prototype which can easily beat the fastest out-of-the-box-solution.

Let's assume that the "set" has n elements and the "series" has m elements. The running time is then:

 T(n,m)=T_preprocess(n)+m*T_lookup(n)

For the pure-python version, that means:

T_preprocess(n)=0 - no preprocessing needed
T_lookup(n)=O(1) - well known behavior of python's set
results in T(n,m)=O(m)

What happens for pd.Series.isin(x_arr)? Obviously, if we skip the preprocessing and search in linear time we will get O(n*m), which is not acceptable.

It is easy to see with help of a debugger or a profiler (I used valgrind-callgrind+kcachegrind), what is going on: the working horse is the function __pyx_pw_6pandas_5_libs_9hashtable_23ismember_int64. Its definition can be found here:

In a preprocessing step, a hash-map (pandas uses khash from klib) is created out of n elements from x_arr, i.e. in running time O(n).
m look-ups happen in O(1) each or O(m) in total in the constructed hash-map.
results in T(n,m)=O(m)+O(n)

We must remember - the elements of numpy-array are raw-C-integers and not the Python-objects in the original set - so we cannot use the set as it is.

An alternative to converting the set of Python-objects to a set of C-ints, would be to convert the single C-ints to Python-object and thus be able to use the original set. That is what happens in [i in x_set for i in ser.values]-variant:

No preprocessing.
m look-ups happen in O(1) time each or O(m) in total, but the look-up is slower due to necessary creation of a Python-object.
results in T(n,m)=O(m)

Clearly, you could speed-up this version a little bit by using Cython.

But enough theory, let's take a look at the running times for different ns with fixed ms:

We can see: the linear time of preprocessing dominates the numpy-version for big ns. The version with conversion from numpy to pure-python (numpy->python) has the same constant behavior as the pure-python version but is slower, because of the necessary conversion - this all in accordance with our analysis.

That cannot be seen well in the diagram: if n < m the numpy version becomes faster - in this case the faster look-up of khash-lib plays the most important role and not the preprocessing-part.

My take-aways from this analysis:

n < m: pd.Series.isin should be taken because O(n)-preprocessing isn't that costly.
n > m: (probably cythonized version of) [i in x_set for i in ser.values] should be taken and thus O(n) avoided.
clearly there is a gray zone where n and m are approximately equal and it is hard to tell which solution is best without testing.
If you have it under your control: The best thing would be to build the set directly as a C-integer-set (khash (already wrapped in pandas) or maybe even some c++-implementations), thus eliminating the need for preprocessing. I don't know, whether there is something in pandas you could reuse, but it is probably not a big deal to write the function in Cython.

The problem is that the last suggestion doesn't work out of the box, as neither pandas nor numpy have a notion of a set (at least to my limited knowledge) in their interfaces. But having raw-C-set-interfaces would be best of both worlds:

no preprocessing needed because values are already passed as a set
no conversion needed because the passed set consists of raw-C-values

I've coded a quick and dirty Cython-wrapper for khash (inspired by the wrapper in pandas), which can be installed via pip install https://github.com/realead/cykhash/zipball/master and then used with Cython for a faster isin version:

%%cython
import numpy as np
cimport numpy as np

from cykhash.khashsets cimport Int64Set

def isin_khash(np.ndarray[np.int64_t, ndim=1] a, Int64Set b):
    cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)
    cdef int i
    for i in range(a.size):
        res[i]=b.contains(a[i])
    return res

As a further possibility the c++'s unordered_map can be wrapped (see listing C), which has the disadvantage of needing c++-libraries and (as we will see) is slightly slower.

Comparing the approaches (see listing D for creating of timings):

khash is about factor 20 faster than the numpy->python, about factor 6 faster than the pure python (but pure-python is not what we want anyway) and even about factor 3 faster than the cpp's-version.

Listings

1) profiling with valgrind:

#isin.py
import numpy as np
import pandas as pd

np.random.seed(0)

x_set = {i for i in range(2*10**6)}
x_arr = np.array(list(x_set))


arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)


for _ in range(10):
   ser.isin(x_arr)

and now:

>>> valgrind --tool=callgrind python isin.py
>>> kcachegrind

leads to the following call graph:

B: ipython code for producing the running times:

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

np.random.seed(0)

x_set = {i for i in range(10**2)}
x_arr = np.array(list(x_set))
x_list = list(x_set)

arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

n=10**3
result=[]
while n<3*10**6:
    x_set = {i for i in range(n)}
    x_arr = np.array(list(x_set))
    x_list = list(x_set)

    t1=%timeit -o  ser.isin(x_arr) 
    t2=%timeit -o  [i in x_set for i in lst]
    t3=%timeit -o  [i in x_set for i in ser.values]

    result.append([n, t1.average, t2.average, t3.average])
    n*=2

#plotting result:
for_plot=np.array(result)
plt.plot(for_plot[:,0], for_plot[:,1], label='numpy')
plt.plot(for_plot[:,0], for_plot[:,2], label='python')
plt.plot(for_plot[:,0], for_plot[:,3], label='numpy->python')
plt.xlabel('n')
plt.ylabel('running time')
plt.legend()
plt.show()

C: cpp-wrapper:

%%cython --cplus -c=-std=c++11 -a

from libcpp.unordered_set cimport unordered_set

cdef class HashSet:
    cdef unordered_set[long long int] s
    cpdef add(self, long long int z):
        self.s.insert(z)
    cpdef bint contains(self, long long int z):
        return self.s.count(z)>0

import numpy as np
cimport numpy as np

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)

def isin_cpp(np.ndarray[np.int64_t, ndim=1] a, HashSet b):
    cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)
    cdef int i
    for i in range(a.size):
        res[i]=b.contains(a[i])
    return res

D: plotting results with different set-wrappers:

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from cykhash import Int64Set

np.random.seed(0)

x_set = {i for i in range(10**2)}
x_arr = np.array(list(x_set))
x_list = list(x_set)


arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

n=10**3
result=[]
while n<3*10**6:
    x_set = {i for i in range(n)}
    x_arr = np.array(list(x_set))
    cpp_set=HashSet()
    khash_set=Int64Set()

    for i in x_set:
        cpp_set.add(i)
        khash_set.add(i)


    assert((ser.isin(x_arr).values==isin_cpp(ser.values, cpp_set)).all())
    assert((ser.isin(x_arr).values==isin_khash(ser.values, khash_set)).all())


    t1=%timeit -o  isin_khash(ser.values, khash_set)
    t2=%timeit -o  isin_cpp(ser.values, cpp_set) 
    t3=%timeit -o  [i in x_set for i in lst]
    t4=%timeit -o  [i in x_set for i in ser.values]

    result.append([n, t1.average, t2.average, t3.average, t4.average])
    n*=2

#ploting result:
for_plot=np.array(result)
plt.plot(for_plot[:,0], for_plot[:,1], label='khash')
plt.plot(for_plot[:,0], for_plot[:,2], label='cpp')
plt.plot(for_plot[:,0], for_plot[:,3], label='pure python')
plt.plot(for_plot[:,0], for_plot[:,4], label='numpy->python')
plt.xlabel('n')
plt.ylabel('running time')
ymin, ymax = plt.ylim()
plt.ylim(0,ymax)
plt.legend()
plt.show()

这篇关于Pandas pd.Series.isin的性能与集合与数组的关系的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pandas pd.Series.isin的性能与集合与数组的关系 [英] Pandas pd.Series.isin performance with set versus array

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pandas pd.Series.isin的性能与集合与数组的关系 [英] Pandas pd.Series.isin performance with set versus array

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭