在pandas数据框上建立索引查找.为什么这么慢?如何加快速度? [英] Indexed lookup on pandas dataframe. Why so slow? How to speed up?

查看：202 发布时间：2020/11/27 20:11:29 python pandas performance indexing

本文介绍了在pandas数据框上建立索引查找.为什么这么慢?如何加快速度?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个熊猫系列，希望用作多图(每个索引键有多个值):

Suppose I have an pandas series that I'd like to function as a multimap (multiple values for each index key):

# intval -> data1
a = pd.Series(data=-np.arange(100000),
              index=np.random.randint(0, 50000, 100000))

我想(尽快)选择a中的所有值其中a的索引与另一个索引b匹配. (就像是内部联接.或者是合并，但要是系列).

I'd like to select (as quickly as possible) all the values from a where a's index matches another index b. (Like an inner join. Or a merge but for series).

a的索引中可能有重复项.
b可能没有重复项，并且不一定是a索引的子集.为了给熊猫最大的机会，让我们假设b也可以作为已排序的索引对象提供:

a may have duplicates in its index.
b may not have duplicates and it not necessarily a subset of a's index. To give pandas the best possible chance, let's assume b can also be provided as a sorted index object:

     b = pd.Index(np.unique(np.random.randint(30000, 100000, 100000))).sortvalues()

所以，我们会有类似的东西:

So, we would have something like:

                      target  
   a        b         result
3  0        3      3  0
3  1        7      8  3 
4  2        8      ...     
8  3      ...
9  4
...

我也只对获取结果的值感兴趣(不需要索引[3,8,...]).

I'm also only interested in getting the values of the result (index [3,8,...] not needed).

如果a没有重复项，我们将简单地这样做:

If a did not have duplicates, we would simply do:

a.reindex(b)  # Cannot reindex a duplicate axis

因为&维护a的重复项，所以我们不能这样做:

Because & maintains the duplicates of a, we can't do:

d = a[a.index & b.index]
d = a.loc[a.index & b.index]  # same
d = a.get(a.index & b.index)  # same
print d.shape

所以我认为我们需要做类似的事情:

So I think we need to do something like:

common = (a.index & b.index).unique()
a.loc[common]

...这很麻烦，但速度却令人惊讶.它不是建立缓慢的选择项列表:

... which is cumbersome, but also is surprising slow. It's not build the list of items to select that's slow:

%timeit (a.index & b).unique()
# 100 loops, best of 3: 3.39 ms per loop
%timeit (a.index & b).unique().sort_values()
# 100 loops, best of 3: 4.19 ms per loop

...所以看起来它实际上是在检索缓慢的值:

... so it look like its really retrieving the values that's slow:

common = ((a.index & b).unique()).sort_values()

%timeit a.loc[common]
#10 loops, best of 3: 43.3 ms per loop

%timeit a.get(common)
#10 loops, best of 3: 42.1 ms per loop

...大约每秒20次操作.不完全是zippy！为什么这么慢?

... That's around 20 operations per seconds. Not exactly zippy! Why so slow?

肯定有一种快速的方法可以从pandas数据框中查找值集吗?我不想获取索引对象-实际上，我要提供的只是对已排序索引的合并，或者是(较慢的)哈希int查找.无论哪种方式，这都应该是极其的快速操作-而不是我的3Ghz机器上每秒20的操作.

Surely there must be a fast way to lookup as set of values from pandas dataframe? I don't want to get an indexed object out -- really all I'm asking for is a merge over sorted indexes, or (slower) hashed int lookups. Either way, this should be an extremely fast operation -- not a 20 per second operation on my 3Ghz machine.

也:

分析a.loc[common]给出:

ncalls  tottime  percall  cumtime   percall filename:lineno(function)
# All the time spent here.
40      1.01     0.02525  1.018     0.02546 ~:0(<method 'get_indexer_non_unique' indexing.py:1443(_has_valid_type)
...
# seems to be called a lot.
1500    0.000582 3.88e-07 0.000832  5.547e-07 ~:0(<isinstance>)

PS.我之前发布了一个类似的问题，关于为什么Series.map这么慢为什么pandas.series.map如此令人震惊地缓慢?.原因是引擎盖下的索引懒惰.这似乎没有发生.

PS. I posted a similar question previously, about why Series.map is so slow Why is pandas.series.map so shockingly slow? . The reason was lazy-under-the-hood-indexing. This doesn't seem to be happening here.

更新:

对于大小类似的a和常见的a是唯一的:

For similarly sizes a and common where a is unique:

% timeit a.loc[common]
1000 loops, best of 3: 760 µs per loop

...就像@jpp指出的那样.多索引很可能会造成责任.

... as @jpp points out. Multiindex is likely to blame.

在pandas数据框上建立索引查找.为什么这么慢?如何加快速度? [英] Indexed lookup on pandas dataframe. Why so slow? How to speed up?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在pandas数据框上建立索引查找.为什么这么慢?如何加快速度? [英] Indexed lookup on pandas dataframe. Why so slow? How to speed up?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭