pandas 查询时间比较 [英] Comparison of Pandas lookup times

查看:119
本文介绍了 pandas 查询时间比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在对Pandas(0.17.1)DataFrame上的各种类型的查询进行计时之后,我剩下几个问题.

After experimenting with timing various types of lookups on a Pandas (0.17.1) DataFrame I am left with a few questions.

这里是设置...

import pandas as pd
import numpy as np
import itertools

letters = [chr(x) for x in range(ord('a'), ord('z'))]
letter_combinations = [''.join(x) for x in itertools.combinations(letters, 3)]

df1 = pd.DataFrame({
        'value': np.random.normal(size=(1000000)), 
        'letter': np.random.choice(letter_combinations, 1000000)
    })
df2 = df1.sort_values('letter')
df3 = df1.set_index('letter')
df4 = df3.sort_index()

所以df1看起来像这样...

So df1 looks something like this...

print(df1.head(5))


>>>
  letter     value
0    bdh  0.253778
1    cem -1.915726
2    mru -0.434007
3    lnw -1.286693
4    fjv  0.245523

这是测试查找性能差异的代码...

Here is the code to test differences in lookup performance...

print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df1[df1.letter == 'ben']
%timeit df1[df1.letter == 'amy']
%timeit df1[df1.letter == 'abe']

print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df2[df2.letter == 'ben']
%timeit df2[df2.letter == 'amy']
%timeit df2[df2.letter == 'abe']

print('~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df3.loc['ben']
%timeit df3.loc['amy']
%timeit df3.loc['abe']

print('~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df4.loc['ben']
%timeit df4.loc['amy']
%timeit df4.loc['abe']

结果...

~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 193 ms per loop
~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 4.66 times longer than the fastest. This could mean that an intermediate result is being cached 
10 loops, best of 3: 40.9 ms per loop
10 loops, best of 3: 41 ms per loop
10 loops, best of 3: 40.9 ms per loop
~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 1621.00 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 259 µs per loop
1000 loops, best of 3: 242 µs per loop
1000 loops, best of 3: 243 µs per loop

问题...

  1. 很明显,为什么对排序索引的查找如此之快,二进制搜索获得O(log(n))性能,而不是全阵列扫描的O(n)性能.但是,为什么对排序后的未索引df2 SLOWER 的查找比对未排序的未索引列df1的查找呢?

  1. It's pretty clear why the lookup on the sorted index is so much faster, binary search to get O(log(n)) performance vs O(n) for a full array scan. But, why is the lookup on the sorted non-indexed df2 column SLOWER than the lookup on the unsorted non-indexed column df1?

The slowest run took x times longer than the fastest. This could mean that an intermediate result is being cached有什么问题.当然,结果不会被缓存.是因为创建的索引是惰性的,并且直到需要时才实际上重新索引吗?这就可以解释为什么只在第一次调用.loc[]的情况.

What is up with the The slowest run took x times longer than the fastest. This could mean that an intermediate result is being cached. Surely, the results aren't being cached. Is it because the created index is lazy and isn't actually reindexed until needed? That would explain why it is only on the first call to .loc[].

为什么索引默认不排序?排序的固定成本可能太高了?

Why isn't an index sorted by default? The fixed cost of the sort can be too much?

推荐答案

这些%timeit结果中的差异

The disparity in these %timeit results

In [273]: %timeit df1[df1['letter'] == 'ben']
10 loops, best of 3: 36.1 ms per loop

In [274]: %timeit df2[df2['letter'] == 'ben']
10 loops, best of 3: 108 ms per loop

还会显示在纯NumPy 相等比较中:

also shows up in the pure NumPy equality comparisons:

In [275]: %timeit df1['letter'].values == 'ben'
10 loops, best of 3: 24.1 ms per loop

In [276]: %timeit df2['letter'].values == 'ben'
10 loops, best of 3: 96.5 ms per loop

内幕下,熊猫的df1['letter'] == 'ben' 叫Cython 功能 循环遍历基础NumPy数组的值, df1['letter'].values.本质上,它在做与 df1['letter'].values == 'ben',但对NaN的处理不同.

Under the hood, Pandas' df1['letter'] == 'ben' calls a Cython function which loops through the values of the underlying NumPy array, df1['letter'].values. It is essentially doing the same thing as df1['letter'].values == 'ben' but with different handling of NaNs.

此外,请注意,只需访问 相较于df2['letter']

Moreover, notice that simply accessing the items in df1['letter'] in sequential order can be done more quickly than doing the same for df2['letter']:

In [11]: %timeit [item for item in df1['letter']]
10 loops, best of 3: 49.4 ms per loop

In [12]: %timeit [item for item in df2['letter']]
10 loops, best of 3: 124 ms per loop

这三组%timeit测试中每组的时间差为 大致相同.我认为这是因为他们都有相同的原因.

The difference in times within each of these three sets of %timeit tests are roughly the same. I think that is because they all share the same cause.

由于letter列包含字符串,因此NumPy数组df1['letter'].valuesdf2['letter'].values具有dtype object,因此它们持有 指向任意Python对象(在本例中为字符串)的内存位置的指针.

Since the letter column holds strings, the NumPy arrays df1['letter'].values and df2['letter'].values have dtype object and therefore they hold pointers to the memory location of the arbitrary Python objects (in this case strings).

考虑存储在DataFrames,df1df2.在CPython中,id返回对象的内存位置:

Consider the memory location of the strings stored in the DataFrames, df1 and df2. In CPython the id returns the memory location of the object:

memloc = pd.DataFrame({'df1': list(map(id, df1['letter'])),
                       'df2': list(map(id, df2['letter'])), })

               df1              df2
0  140226328244040  140226299303840
1  140226328243088  140226308389048
2  140226328243872  140226317328936
3  140226328243760  140226230086600
4  140226328243368  140226285885624

df1中的字符串(大约前十二个字符串之后)倾向于顺序出现 在内存中,而排序会使df2中的字符串(按顺序)为 分散在内存中:

The strings in df1 (after the first dozen or so) tend to appear sequentially in memory, while sorting causes the strings in df2 (taken in order) to be scattered in memory:

In [272]: diffs = memloc.diff(); diffs.head(30)
Out[272]: 
         df1         df2
0        NaN         NaN
1     -952.0   9085208.0
2      784.0   8939888.0
3     -112.0 -87242336.0
4     -392.0  55799024.0
5     -392.0   5436736.0
6      952.0  22687184.0
7       56.0 -26436984.0
8     -448.0  24264592.0
9      -56.0  -4092072.0
10    -168.0 -10421232.0
11 -363584.0   5512088.0
12      56.0 -17433416.0
13      56.0  40042552.0
14      56.0 -18859440.0
15      56.0 -76535224.0
16      56.0  94092360.0
17      56.0  -4189368.0
18      56.0     73840.0
19      56.0  -5807616.0
20      56.0  -9211680.0
21      56.0  20571736.0
22      56.0 -27142288.0
23      56.0   5615112.0
24      56.0  -5616568.0
25      56.0   5743152.0
26      56.0 -73057432.0
27      56.0  -4988200.0
28      56.0  85630584.0
29      56.0  -4706136.0

df1中的大多数字符串相隔56个字节:

Most of the strings in df1 are 56 bytes apart:

In [14]: 
In [16]: diffs['df1'].value_counts()
Out[16]: 
 56.0           986109
 120.0           13671
-524168.0          215
-56.0                1
-12664712.0          1
 41136.0             1
-231731080.0         1
Name: df1, dtype: int64

In [20]: len(diffs['df1'].value_counts())
Out[20]: 7

相反,df2中的字符串分散在整个位置:

In contrast the strings in df2 are scattered all over the place:

In [17]: diffs['df2'].value_counts().head()
Out[17]: 
-56.0     46
 56.0     44
 168.0    39
-112.0    37
-392.0    35
Name: df2, dtype: int64

In [19]: len(diffs['df2'].value_counts())
Out[19]: 837764

当这些对象(字符串)顺序位于内存中时,它们的值 可以更快地检索到.这就是为什么由 df1['letter'].values == 'ben'的完成速度可以比df2['letter'].values == 'ben'中的完成速度更快. 查找时间更短.

When these objects (strings) are located sequentially in memory, their values can be retrieved more quickly. This is why the equality comparisons performed by df1['letter'].values == 'ben' can be done faster than those in df2['letter'].values == 'ben'. The lookup time is smaller.

此内存访问问题也说明了为什么 value列的%timeit结果.

This memory accessing issue also explains why there is no disparity in the %timeit results for the value column.

In [5]: %timeit df1[df1['value'] == 0]
1000 loops, best of 3: 1.8 ms per loop

In [6]: %timeit df2[df2['value'] == 0]
1000 loops, best of 3: 1.78 ms per loop

df1['value']df2['value']是dtype float64的NumPy数组.与对象不同 数组,它们的值在内存中连续打包在一起.排序df1df2 = df1.sort_values('letter')一起使用会导致df2['value']中的值是 重新排序,但是由于将值复制到新的NumPy数组中,因此这些值 依次位于内存中.因此,访问df2['value']中的值可以 与df1['value']中的操作一样快.

df1['value'] and df2['value'] are NumPy arrays of dtype float64. Unlike object arrays, their values are packed together contiguously in memory. Sorting df1 with df2 = df1.sort_values('letter') causes the values in df2['value'] to be reordered, but since the values are copied into a new NumPy array, the values are located sequentially in memory. So accessing the values in df2['value'] can be done just as quickly as those in df1['value'].

这篇关于 pandas 查询时间比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆