为什么pandas.series.map如此惊人地慢? [英] Why is pandas.series.map so shockingly slow?

查看:200
本文介绍了为什么pandas.series.map如此惊人地慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有时候,我只是讨厌使用中间件.以这个为例:我想要一个查找表,该表将一组输入(域)值中的值映射到输出(范围)值中.映射是唯一的. Python映射可以做到这一点,但是由于我认为该映射相当大,为什么不使用ps.Series及其索引,这又增加了我可以做到的好处:

Some days I just hate using middleware. Take this for example: I'd like to have a lookup table that maps values from a set of inputs (domain) values, to outputs (range) values. The mapping is unique. A Python map can do this, but since the map is quite big I figured, why not use a ps.Series and its index, which has added benefit that I can:

  • 传入多个值以将其映射为一个序列(希望比字典查找要快)
  • 原始系列的索引保留在结果中

像这样:

domain2range = pd.Series(allrangevals, index=alldomainvals)
# Apply the map
query_vals = pd.Series(domainvals, index=someindex)
result = query_vals.map(domain2range)
assert result.index is someindex # Nice
assert (result.values in allrangevals).all() # Nice

按预期工作.但不是.上面的.map的时间成本随着len(domain2range)的增加而不是(更明智地)O(len(query_vals))的增加,如下所示:

Works as expected. But not. The above .map's time cost grows with len(domain2range) not (more sensibly) O(len(query_vals)) as can be shown:

numiter = 100
for n in [10, 1000, 1000000, 10000000,]:
    domain = np.arange(0, n)
    range = domain+10
    maptable = pd.Series(range, index=domain).sort_index()

    query_vals = pd.Series([1,2,3])
    def f():
        query_vals.map(maptable)
    print n, timeit.timeit(stmt=f, number=numiter)/numiter


10 0.000630810260773
1000 0.000978469848633
1000000 0.00130645036697
10000000 0.0162791204453

facepalm .在n = 10000000时,每个映射值占用的时间为(0.01/3)秒.

facepalm. At n=10000000 its taken (0.01/3) second per mapped value.

所以,问题:

  • Series.map 预期会像这样吗?为什么它如此彻底,荒谬地缓慢?我认为我正在使用它,如文档所示.
  • 是否有使用熊猫进行表格查找的快速方法.看来上面不是吗?
  • is Series.map expected to behave like this? Why is it so utterly, ridiculously slow? I think I'm using it as shown in the docs.
  • is there a fast way to use pandas to do table-lookup. It seems like the above is not it?

推荐答案

https://github.com/pandas-dev/pandas/issues/21278

预热是问题所在. (双脸).熊猫在首次使用时会默默地构建并缓存哈希索引(O(maplen)).调用已测试的函数并预先建立索引可以获得更好的性能.

Warmup was the issue. (double facepalm). Pandas silently builds and caches a hash index at first use (O(maplen)). Calling the tested function and prebuilding the indexgets much better performance.

numiter = 100
for n in [10, 100000, 1000000, 10000000,]:
    domain = np.arange(0, n)
    range = domain+10
    maptable = pd.Series(range, index=domain) #.sort_index()

    query_vals = pd.Series([1,2,3])

    def f1():
        query_vals.map(maptable)
    f1()
    print "Pandas1 ", n, timeit.timeit(stmt=f1, number=numiter)/numiter

    def f2():
        query_vals.map(maptable.get)
    f2()
    print "Pandas2 ", n, timeit.timeit(stmt=f2, number=numiter)/numiter

    maptabledict = maptable.to_dict()
    query_vals_list = pd.Series([1,2,3]).tolist()

    def f3():
        {k: maptabledict[k] for k in query_vals_list}
    f3()
    print "Py dict ", n, timeit.timeit(stmt=f3, number=numiter)/numiter
    print

pd.show_versions()
Pandas1  10 0.000621199607849
Pandas2  10 0.000686831474304
Py dict  10 2.0170211792e-05

Pandas1  100000 0.00149286031723
Pandas2  100000 0.00118808984756
Py dict  100000 8.47816467285e-06

Pandas1  1000000 0.000708899497986
Pandas2  1000000 0.000479419231415
Py dict  1000000 1.64794921875e-05

Pandas1  10000000 0.000798969268799
Pandas2  10000000 0.000410139560699
Py dict  10000000 1.47914886475e-05

...尽管有点令人沮丧,但python字典的速度要快10倍.

... although a little depressing that python dictionaries are 10x faster.

这篇关于为什么pandas.series.map如此惊人地慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆