Pandas pd.Series.isin的性能与集合与数组的关系 [英] Pandas pd.Series.isin performance with set versus array
问题描述
通常在Python中,最好通过set
测试可散列集合的成员资格.我们之所以知道这一点,是因为使用散列可以为list
或np.ndarray
提供O(1)查找复杂性,而不是O(n).
在熊猫中,我经常不得不检查非常大的收藏中的会员资格.我假设同样适用,即检查系列中的每个项目是否属于set
中的成员资格比使用list
或np.ndarray
更有效.但是,情况似乎并非如此:
import numpy as np
import pandas as pd
np.random.seed(0)
x_set = {i for i in range(100000)}
x_arr = np.array(list(x_set))
x_list = list(x_set)
arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()
%timeit ser.isin(x_set) # 8.9 ms
%timeit ser.isin(x_arr) # 2.17 ms
%timeit ser.isin(x_list) # 7.79 ms
%timeit np.in1d(arr, x_arr) # 5.02 ms
%timeit [i in x_set for i in lst] # 1.1 ms
%timeit [i in x_set for i in ser.values] # 4.61 ms
用于测试的版本:
np.__version__ # '1.14.3'
pd.__version__ # '0.23.0'
sys.version # '3.6.5'
pd.Series.isin
使用 numpy.in1d
,大概意味着set
到np.ndarray
转换的开销很大.
否定构建投入的成本,对熊猫的影响:
- 如果您知道
x_list
或x_arr
的元素是唯一的,请不要费心转换为x_set
.与Pandas一起使用,这将是昂贵的(转换和成员资格测试). - 使用列表推导是从O(1)集查找中受益的唯一方法.
我的问题是:
- 我的上述分析正确吗?这似乎是
pd.Series.isin
的实现方式的一个显而易见但尚未记录的结果. - 有没有使用列表理解或
pd.Series.apply
的变通办法,而做是使用O(1)设置查找的吗?还是以NumPy作为熊猫的骨干是不可避免的设计选择和/或必然结果?
更新:在较旧的设置(熊猫/NumPy版本)上,我看到x_set
的表现优于x_arr
和pd.Series.isin
.因此,还有一个问题:是否有任何东西从旧变新,以致导致set
的性能变差?
%timeit ser.isin(x_set) # 10.5 ms
%timeit ser.isin(x_arr) # 15.2 ms
%timeit ser.isin(x_list) # 9.61 ms
%timeit np.in1d(arr, x_arr) # 4.15 ms
%timeit [i in x_set for i in lst] # 1.15 ms
%timeit [i in x_set for i in ser.values] # 2.8 ms
pd.__version__ # '0.19.2'
np.__version__ # '1.11.3'
sys.version # '3.6.0'
这可能并不明显,但是pd.Series.isin
使用O(1)
-查找每个元素.
经过分析,证明了上面的陈述,我们将利用其洞察力创建一个Cython原型,该原型可以轻松击败最快的即用型解决方案.
让我们假设集合"具有n
元素,而系列"具有m
元素.然后运行时间为:
T(n,m)=T_preprocess(n)+m*T_lookup(n)
对于纯python版本,这意味着:
-
T_preprocess(n)=0
-无需预处理 -
T_lookup(n)=O(1)
-python集合的众所周知行为 - 得出
T(n,m)=O(m)
pd.Series.isin(x_arr)
会发生什么?显然,如果跳过预处理并在线性时间内搜索,则会得到O(n*m)
,这是不可接受的.
在调试器或探查器(我使用valgrind-callgrind + kcachegrind)的帮助下很容易看出来,发生了什么:工作马是函数__pyx_pw_6pandas_5_libs_9hashtable_23ismember_int64
.可以在此处找到它的定义>:
- 在预处理步骤中,从klib中创建了一个哈希图(熊猫使用喀什)
x_arr
中的n
个元素,即在运行时间O(n)
中.
在构建的哈希图中,分别在 -
m
个查找. - 得出
T(n,m)=O(m)+O(n)
O(1)
个或全部O(m)
个中进行我们必须记住-numpy-array的元素是原始C-整数,而不是原始集合中的Python对象-因此我们不能按原样使用集合.
将一组Python对象转换为一组C-int的替代方法是将单个C-int转换为Python对象,从而能够使用原始的C-int.这就是[i in x_set for i in ser.values]
-variant:
- 不进行预处理. 每次
- m次查找都在
O(1)
时间或总计O(m)
中进行,但是由于必须创建Python对象,因此查找速度较慢. - 得出
T(n,m)=O(m)
很显然,您可以使用Cython加快该版本的速度.
但是有足够的理论,让我们看一下固定m
的不同n
的运行时间:
我们可以看到:对于大的n
,预处理的线性时间主导着numpy版本.从numpy转换为纯python(numpy->python
)的版本与纯python版本具有相同的恒定行为,但由于必须进行转换而速度较慢-所有这些都符合我们的分析.
在图中不能很好地看到:如果n < m
numpy版本变得更快-在这种情况下,khash
-lib的快速查找将扮演最重要的角色,而不是预处理部分.
我从这项分析中得出的结论:
-
n < m
:pd.Series.isin
应该被采用,因为O(n)
-预处理的成本并不高. -
n > m
:(应该是cythonized版本的)[i in x_set for i in ser.values]
,因此应避免使用O(n)
. -
显然有一个灰色区域,其中
n
和m
大致相等,如果不进行测试,很难确定哪种解决方案是最佳的. -
如果您可以控制它:最好的方法是直接将
set
作为C整数集(khash
( Cython-wrapper for khash (受熊猫包装程序的启发) ),可以通过pip install https://github.com/realead/cykhash/zipball/master
安装,然后与Cython配合使用,以实现更快的isin
版本:%%cython import numpy as np cimport numpy as np from cykhash.khashsets cimport Int64Set def isin_khash(np.ndarray[np.int64_t, ndim=1] a, Int64Set b): cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool) cdef int i for i in range(a.size): res[i]=b.contains(a[i]) return res
作为另外一种可能,可以包装c ++的
unordered_map
(请参见清单C),这具有需要c ++库的缺点,并且(我们将看到)速度稍慢.比较方法(请参见清单D中的计时创建方法):
khash比
numpy->python
快20倍,比纯python快6倍(但是,无论如何,纯python并不是我们想要的),甚至比cpp的版本快3倍. >
列表
1)使用valgrind进行分析:
#isin.py import numpy as np import pandas as pd np.random.seed(0) x_set = {i for i in range(2*10**6)} x_arr = np.array(list(x_set)) arr = np.random.randint(0, 20000, 10000) ser = pd.Series(arr) for _ in range(10): ser.isin(x_arr)
现在:
>>> valgrind --tool=callgrind python isin.py >>> kcachegrind
导致以下调用图:
B:用于生成运行时间的ipython代码:
import numpy as np import pandas as pd %matplotlib inline import matplotlib.pyplot as plt np.random.seed(0) x_set = {i for i in range(10**2)} x_arr = np.array(list(x_set)) x_list = list(x_set) arr = np.random.randint(0, 20000, 10000) ser = pd.Series(arr) lst = arr.tolist() n=10**3 result=[] while n<3*10**6: x_set = {i for i in range(n)} x_arr = np.array(list(x_set)) x_list = list(x_set) t1=%timeit -o ser.isin(x_arr) t2=%timeit -o [i in x_set for i in lst] t3=%timeit -o [i in x_set for i in ser.values] result.append([n, t1.average, t2.average, t3.average]) n*=2 #plotting result: for_plot=np.array(result) plt.plot(for_plot[:,0], for_plot[:,1], label='numpy') plt.plot(for_plot[:,0], for_plot[:,2], label='python') plt.plot(for_plot[:,0], for_plot[:,3], label='numpy->python') plt.xlabel('n') plt.ylabel('running time') plt.legend() plt.show()
C:cpp包装器:
%%cython --cplus -c=-std=c++11 -a from libcpp.unordered_set cimport unordered_set cdef class HashSet: cdef unordered_set[long long int] s cpdef add(self, long long int z): self.s.insert(z) cpdef bint contains(self, long long int z): return self.s.count(z)>0 import numpy as np cimport numpy as np cimport cython @cython.boundscheck(False) @cython.wraparound(False) def isin_cpp(np.ndarray[np.int64_t, ndim=1] a, HashSet b): cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool) cdef int i for i in range(a.size): res[i]=b.contains(a[i]) return res
D:使用不同的包装器绘制结果:
import numpy as np import pandas as pd %matplotlib inline import matplotlib.pyplot as plt from cykhash import Int64Set np.random.seed(0) x_set = {i for i in range(10**2)} x_arr = np.array(list(x_set)) x_list = list(x_set) arr = np.random.randint(0, 20000, 10000) ser = pd.Series(arr) lst = arr.tolist() n=10**3 result=[] while n<3*10**6: x_set = {i for i in range(n)} x_arr = np.array(list(x_set)) cpp_set=HashSet() khash_set=Int64Set() for i in x_set: cpp_set.add(i) khash_set.add(i) assert((ser.isin(x_arr).values==isin_cpp(ser.values, cpp_set)).all()) assert((ser.isin(x_arr).values==isin_khash(ser.values, khash_set)).all()) t1=%timeit -o isin_khash(ser.values, khash_set) t2=%timeit -o isin_cpp(ser.values, cpp_set) t3=%timeit -o [i in x_set for i in lst] t4=%timeit -o [i in x_set for i in ser.values] result.append([n, t1.average, t2.average, t3.average, t4.average]) n*=2 #ploting result: for_plot=np.array(result) plt.plot(for_plot[:,0], for_plot[:,1], label='khash') plt.plot(for_plot[:,0], for_plot[:,2], label='cpp') plt.plot(for_plot[:,0], for_plot[:,3], label='pure python') plt.plot(for_plot[:,0], for_plot[:,4], label='numpy->python') plt.xlabel('n') plt.ylabel('running time') ymin, ymax = plt.ylim() plt.ylim(0,ymax) plt.legend() plt.show()
In Python generally, membership of a hashable collection is best tested via
set
. We know this because the use of hashing gives us O(1) lookup complexity versus O(n) forlist
ornp.ndarray
.In Pandas, I often have to check for membership in very large collections. I presumed that the same would apply, i.e. checking each item of a series for membership in a
set
is more efficient than usinglist
ornp.ndarray
. However, this doesn't seem to be the case:import numpy as np import pandas as pd np.random.seed(0) x_set = {i for i in range(100000)} x_arr = np.array(list(x_set)) x_list = list(x_set) arr = np.random.randint(0, 20000, 10000) ser = pd.Series(arr) lst = arr.tolist() %timeit ser.isin(x_set) # 8.9 ms %timeit ser.isin(x_arr) # 2.17 ms %timeit ser.isin(x_list) # 7.79 ms %timeit np.in1d(arr, x_arr) # 5.02 ms %timeit [i in x_set for i in lst] # 1.1 ms %timeit [i in x_set for i in ser.values] # 4.61 ms
Versions used for testing:
np.__version__ # '1.14.3' pd.__version__ # '0.23.0' sys.version # '3.6.5'
The source code for
pd.Series.isin
, I believe, utilisesnumpy.in1d
, which presumably means a large overhead forset
tonp.ndarray
conversion.Negating the cost of constructing the inputs, the implications for Pandas:
- If you know your elements of
x_list
orx_arr
are unique, don't bother converting tox_set
. This will be costly (both conversion and membership tests) for use with Pandas. - Using list comprehensions are the only way to benefit from O(1) set lookup.
My questions are:
- Is my analysis above correct? This seems like an obvious, yet undocumented, result of how
pd.Series.isin
has been implemented. - Is there a workaround, without using a list comprehension or
pd.Series.apply
, which does utilise O(1) set lookup? Or is this an unavoidable design choice and/or corollary of having NumPy as the backbone of Pandas?
Update: On an older setup (Pandas / NumPy versions) I see
x_set
outperformx_arr
withpd.Series.isin
. So an additional question: has anything fundamentally changed from old to new to cause performance withset
to worsen?%timeit ser.isin(x_set) # 10.5 ms %timeit ser.isin(x_arr) # 15.2 ms %timeit ser.isin(x_list) # 9.61 ms %timeit np.in1d(arr, x_arr) # 4.15 ms %timeit [i in x_set for i in lst] # 1.15 ms %timeit [i in x_set for i in ser.values] # 2.8 ms pd.__version__ # '0.19.2' np.__version__ # '1.11.3' sys.version # '3.6.0'
解决方案This might not be obvious, but
pd.Series.isin
usesO(1)
-look up per element.After an analysis, which proves the above statement, we will use its insights to create a Cython-prototype which can easily beat the fastest out-of-the-box-solution.
Let's assume that the "set" has
n
elements and the "series" hasm
elements. The running time is then:T(n,m)=T_preprocess(n)+m*T_lookup(n)
For the pure-python version, that means:
T_preprocess(n)=0
- no preprocessing neededT_lookup(n)=O(1)
- well known behavior of python's set- results in
T(n,m)=O(m)
What happens for
pd.Series.isin(x_arr)
? Obviously, if we skip the preprocessing and search in linear time we will getO(n*m)
, which is not acceptable.It is easy to see with help of a debugger or a profiler (I used valgrind-callgrind+kcachegrind), what is going on: the working horse is the function
__pyx_pw_6pandas_5_libs_9hashtable_23ismember_int64
. Its definition can be found here:- In a preprocessing step, a hash-map (pandas uses khash from klib) is created out of
n
elements fromx_arr
, i.e. in running timeO(n)
. m
look-ups happen inO(1)
each orO(m)
in total in the constructed hash-map.- results in
T(n,m)=O(m)+O(n)
We must remember - the elements of numpy-array are raw-C-integers and not the Python-objects in the original set - so we cannot use the set as it is.
An alternative to converting the set of Python-objects to a set of C-ints, would be to convert the single C-ints to Python-object and thus be able to use the original set. That is what happens in
[i in x_set for i in ser.values]
-variant:- No preprocessing.
- m look-ups happen in
O(1)
time each orO(m)
in total, but the look-up is slower due to necessary creation of a Python-object. - results in
T(n,m)=O(m)
Clearly, you could speed-up this version a little bit by using Cython.
But enough theory, let's take a look at the running times for different
n
s with fixedm
s:We can see: the linear time of preprocessing dominates the numpy-version for big
n
s. The version with conversion from numpy to pure-python (numpy->python
) has the same constant behavior as the pure-python version but is slower, because of the necessary conversion - this all in accordance with our analysis.That cannot be seen well in the diagram: if
n < m
the numpy version becomes faster - in this case the faster look-up ofkhash
-lib plays the most important role and not the preprocessing-part.My take-aways from this analysis:
n < m
:pd.Series.isin
should be taken becauseO(n)
-preprocessing isn't that costly.n > m
: (probably cythonized version of)[i in x_set for i in ser.values]
should be taken and thusO(n)
avoided.clearly there is a gray zone where
n
andm
are approximately equal and it is hard to tell which solution is best without testing.If you have it under your control: The best thing would be to build the
set
directly as a C-integer-set (khash
(already wrapped in pandas) or maybe even some c++-implementations), thus eliminating the need for preprocessing. I don't know, whether there is something in pandas you could reuse, but it is probably not a big deal to write the function in Cython.
The problem is that the last suggestion doesn't work out of the box, as neither pandas nor numpy have a notion of a set (at least to my limited knowledge) in their interfaces. But having raw-C-set-interfaces would be best of both worlds:
- no preprocessing needed because values are already passed as a set
- no conversion needed because the passed set consists of raw-C-values
I've coded a quick and dirty Cython-wrapper for khash (inspired by the wrapper in pandas), which can be installed via
pip install https://github.com/realead/cykhash/zipball/master
and then used with Cython for a fasterisin
version:%%cython import numpy as np cimport numpy as np from cykhash.khashsets cimport Int64Set def isin_khash(np.ndarray[np.int64_t, ndim=1] a, Int64Set b): cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool) cdef int i for i in range(a.size): res[i]=b.contains(a[i]) return res
As a further possibility the c++'s
unordered_map
can be wrapped (see listing C), which has the disadvantage of needing c++-libraries and (as we will see) is slightly slower.Comparing the approaches (see listing D for creating of timings):
khash is about factor 20 faster than the
numpy->python
, about factor 6 faster than the pure python (but pure-python is not what we want anyway) and even about factor 3 faster than the cpp's-version.
Listings
1) profiling with valgrind:
#isin.py import numpy as np import pandas as pd np.random.seed(0) x_set = {i for i in range(2*10**6)} x_arr = np.array(list(x_set)) arr = np.random.randint(0, 20000, 10000) ser = pd.Series(arr) for _ in range(10): ser.isin(x_arr)
and now:
>>> valgrind --tool=callgrind python isin.py >>> kcachegrind
leads to the following call graph:
B: ipython code for producing the running times:
import numpy as np import pandas as pd %matplotlib inline import matplotlib.pyplot as plt np.random.seed(0) x_set = {i for i in range(10**2)} x_arr = np.array(list(x_set)) x_list = list(x_set) arr = np.random.randint(0, 20000, 10000) ser = pd.Series(arr) lst = arr.tolist() n=10**3 result=[] while n<3*10**6: x_set = {i for i in range(n)} x_arr = np.array(list(x_set)) x_list = list(x_set) t1=%timeit -o ser.isin(x_arr) t2=%timeit -o [i in x_set for i in lst] t3=%timeit -o [i in x_set for i in ser.values] result.append([n, t1.average, t2.average, t3.average]) n*=2 #plotting result: for_plot=np.array(result) plt.plot(for_plot[:,0], for_plot[:,1], label='numpy') plt.plot(for_plot[:,0], for_plot[:,2], label='python') plt.plot(for_plot[:,0], for_plot[:,3], label='numpy->python') plt.xlabel('n') plt.ylabel('running time') plt.legend() plt.show()
C: cpp-wrapper:
%%cython --cplus -c=-std=c++11 -a from libcpp.unordered_set cimport unordered_set cdef class HashSet: cdef unordered_set[long long int] s cpdef add(self, long long int z): self.s.insert(z) cpdef bint contains(self, long long int z): return self.s.count(z)>0 import numpy as np cimport numpy as np cimport cython @cython.boundscheck(False) @cython.wraparound(False) def isin_cpp(np.ndarray[np.int64_t, ndim=1] a, HashSet b): cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool) cdef int i for i in range(a.size): res[i]=b.contains(a[i]) return res
D: plotting results with different set-wrappers:
import numpy as np import pandas as pd %matplotlib inline import matplotlib.pyplot as plt from cykhash import Int64Set np.random.seed(0) x_set = {i for i in range(10**2)} x_arr = np.array(list(x_set)) x_list = list(x_set) arr = np.random.randint(0, 20000, 10000) ser = pd.Series(arr) lst = arr.tolist() n=10**3 result=[] while n<3*10**6: x_set = {i for i in range(n)} x_arr = np.array(list(x_set)) cpp_set=HashSet() khash_set=Int64Set() for i in x_set: cpp_set.add(i) khash_set.add(i) assert((ser.isin(x_arr).values==isin_cpp(ser.values, cpp_set)).all()) assert((ser.isin(x_arr).values==isin_khash(ser.values, khash_set)).all()) t1=%timeit -o isin_khash(ser.values, khash_set) t2=%timeit -o isin_cpp(ser.values, cpp_set) t3=%timeit -o [i in x_set for i in lst] t4=%timeit -o [i in x_set for i in ser.values] result.append([n, t1.average, t2.average, t3.average, t4.average]) n*=2 #ploting result: for_plot=np.array(result) plt.plot(for_plot[:,0], for_plot[:,1], label='khash') plt.plot(for_plot[:,0], for_plot[:,2], label='cpp') plt.plot(for_plot[:,0], for_plot[:,3], label='pure python') plt.plot(for_plot[:,0], for_plot[:,4], label='numpy->python') plt.xlabel('n') plt.ylabel('running time') ymin, ymax = plt.ylim() plt.ylim(0,ymax) plt.legend() plt.show()
这篇关于Pandas pd.Series.isin的性能与集合与数组的关系的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- If you know your elements of