pandas Series.map()的内部实现是什么? [英] What is the internal implementation of pandas Series.map()?
问题描述
我注意到pandas Series.map()对于字典映射非常快
I notice that pandas Series.map() is extremely fast for dict mapping
准备如下数据:
a=np.random.randint(0,1000,10**5)
s=pd.Series(a)
d=dict(zip(np.arange(1000),np.random.random(1000)))
定时
%timeit -n10 s.map(d)
%timeit -n10 np.vectorize(d.get)(a)
给予
1.42 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
20.6 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
其中第二种方法是典型的建议,即执行我在stackoverflow上发现的numpy dict映射.
where the second approach is typical recommandation to do numpy dict mapping I found on stackoverflow.
numpy的另一种典型解决方案如下
There is another typical solution of numpy as below
%%timeit -n10
b = np.copy(a)
for k, v in d.items():
b[a==k] = v
给出
43.9 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
它甚至更慢,更糟糕的是,它给出了错误的结果.因为b
是int类型,所以赋值b[a==k] = v
将返回b,将全为零!
it is even slower, and what is worse, it gives wrong result. Because b
is int type, assignment b[a==k] = v
will return b will all zeros!
所以我想知道pandas Series.map()的内部实现是什么?它以numpy实现吗?具有相同性能的numpy等同于Series.map()是什么?我试图深入研究Series.map()的源代码,但听不懂.
So I am wondering what is the internal implementation of pandas Series.map()? Does it implemented in numpy? What is the numpy equavalent to Series.map() that has the same performance? I tried to dig into the source code of Series.map() but can not understand it.
推荐答案
Series.map
will call _map_values()
which is part of pandas/core/base.py
您使用的是字典,因此您要通过第一个if is_dict_like(mapper):
子句来获取mapper
,然后在1161-1162行上,获取此基本情况的映射函数(默认情况下为非扩展类型na_action=None
)
You're using a dict so you go through the first if is_dict_like(mapper):
clause to get the mapper
, and then on lines 1161-1162 you get the mapping function for this basic case (non-extension type with default na_action=None
)
else:
map_f = lib.map_infer
如果您随后转到pandas/_libs/lib.pyx中的那部分代码,则会看到
If you then go to that part of the code, found in pandas/_libs/lib.pyx you'll see map_infer
is implemented in cython
.
正如他们在评论中指出的那样,这仅对特定输入而言是如此之快:
As they note in the comments, this is only so fast for specific inputs:
# we can fastpath dict/Series to an efficient map
# as we know that we are not going to have to yield
# python types
这篇关于pandas Series.map()的内部实现是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!