pandas Series.map()的内部实现是什么? [英] What is the internal implementation of pandas Series.map()?

查看:256
本文介绍了pandas Series.map()的内部实现是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到pandas Series.map()对于字典映射非常快

I notice that pandas Series.map() is extremely fast for dict mapping

准备如下数据:

a=np.random.randint(0,1000,10**5)
s=pd.Series(a)
d=dict(zip(np.arange(1000),np.random.random(1000)))

定时

%timeit -n10 s.map(d)
%timeit -n10 np.vectorize(d.get)(a)

给予

1.42 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
20.6 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

其中第二种方法是典型的建议,即执行我在stackoverflow上发现的numpy dict映射.

where the second approach is typical recommandation to do numpy dict mapping I found on stackoverflow.

numpy的另一种典型解决方案如下

There is another typical solution of numpy as below

%%timeit -n10 
b = np.copy(a)
for k, v in d.items():
    b[a==k] = v

给出

43.9 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

它甚至更慢,更糟糕的是,它给出了错误的结果.因为b是int类型,所以赋值b[a==k] = v将返回b,将全为零!

it is even slower, and what is worse, it gives wrong result. Because b is int type, assignment b[a==k] = v will return b will all zeros!

所以我想知道pandas Series.map()的内部实现是什么?它以numpy实现吗?具有相同性能的numpy等同于Series.map()是什么?我试图深入研究Series.map()的源代码,但听不懂.

So I am wondering what is the internal implementation of pandas Series.map()? Does it implemented in numpy? What is the numpy equavalent to Series.map() that has the same performance? I tried to dig into the source code of Series.map() but can not understand it.

推荐答案

Series.map将调用

Series.map will call _map_values() which is part of pandas/core/base.py

您使用的是字典,因此您要通过第一个if is_dict_like(mapper):子句来获取mapper,然后在1161-1162行上,获取此基本情况的映射函数(默认情况下为非扩展类型na_action=None)

You're using a dict so you go through the first if is_dict_like(mapper): clause to get the mapper, and then on lines 1161-1162 you get the mapping function for this basic case (non-extension type with default na_action=None)

else:
    map_f = lib.map_infer

如果您随后转到pandas/_libs/lib.pyx中的那部分代码,则会看到

If you then go to that part of the code, found in pandas/_libs/lib.pyx you'll see map_infer is implemented in cython.

正如他们在评论中指出的那样,这仅对特定输入而言是如此之快:

As they note in the comments, this is only so fast for specific inputs:

# we can fastpath dict/Series to an efficient map
# as we know that we are not going to have to yield
# python types

这篇关于pandas Series.map()的内部实现是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆