为什么len在DataFrame上的效率比在底层numpy数组上的效率高得多? [英] why is len so much more efficient on DataFrame than on underlying numpy array?

查看:419
本文介绍了为什么len在DataFrame上的效率比在底层numpy数组上的效率高得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到在DataFrame上使用len比在底层numpy数组上使用len快得多.我不明白为什么.通过shape访问相同的信息也没有任何帮助.当我尝试获取列数和行数时,这更有意义.我一直在争论使用哪种方法.

I've noticed that using len on a DataFrame is far quicker than using len on the underlying numpy array. I don't understand why. Accessing the same information via shape isn't any help either. This is more relevant as I try to get at the number of columns and number of rows. I was always debating which method to use.

我整理了以下实验,很明显,我将在数据帧上使用len.但是有人可以解释为什么吗?

I put together the following experiment and it's very clear that I will be using len on the dataframe. But can someone explain why?

from timeit import timeit
import pandas as pd
import numpy as np

ns = np.power(10, np.arange(6))
results = pd.DataFrame(
    columns=ns,
    index=pd.MultiIndex.from_product(
        [['len', 'len(values)', 'shape'],
         ns]))
dfs = {(n, m): pd.DataFrame(np.zeros((n, m))) for n in ns for m in ns}

for n, m in dfs.keys():
    df = dfs[(n, m)]
    results.loc[('len', n), m] = timeit('len(df)', 'from __main__ import df', number=10000)
    results.loc[('len(values)', n), m] = timeit('len(df.values)', 'from __main__ import df', number=10000)
    results.loc[('shape', n), m] = timeit('df.values.shape', 'from __main__ import df', number=10000)


fig, axes = plt.subplots(2, 3, figsize=(9, 6), sharex=True, sharey=True)
for i, (m, col) in enumerate(results.iteritems()):
    r, c = i // 3, i % 3
    col.unstack(0).plot.bar(ax=axes[r, c], title=m)

推荐答案

从各种方法来看,主要原因是构造numpy数组df.values花费了大部分时间

From looking at the various methods, the main reason is that constructing the numpy array df.values takes the lion's share of the time.

这两个速度很快,因为它们本质上是

These two are fast because they are essentially

len(df.index._data)

(len(df.index._data), len(df.columns._data))

其中_datanumpy.ndarray.因此,使用df.shape的速度应该是len(df)的一半,因为它可以找到df.indexdf.columns(均为pd.Index类型)的长度

where _data is a numpy.ndarray. Thus, using df.shape should be half as fast as len(df) because it's finding the length of both df.index and df.columns (both of type pd.Index)

假设您已经提取了vals = df.values.然后

Let's say you had already extracted vals = df.values. Then

In [1]: df = pd.DataFrame(np.random.rand(1000, 10), columns=range(10))

In [2]: vals = df.values

In [3]: %timeit len(vals)
10000000 loops, best of 3: 35.4 ns per loop

In [4]: %timeit vals.shape
10000000 loops, best of 3: 51.7 ns per loop

相比:

In [5]: %timeit len(df.values)
100000 loops, best of 3: 3.55 µs per loop

因此瓶颈不是len,而是df.values的构造方式.如果您检查pandas.DataFrame.values(),则会发现(大致等效)方法:

So the bottleneck is not len but how df.values is constructed. If you examine pandas.DataFrame.values(), you'll find the (roughly equivalent) methods:

def values(self):
    return self.as_matrix()

def as_matrix(self, columns=None):
    self._consolidate_inplace()
    if self._AXIS_REVERSED:
        return self._data.as_matrix(columns).T

    if len(self._data.blocks) == 0:
        return np.empty(self._data.shape, dtype=float)

    if columns is not None:
        mgr = self._data.reindex_axis(columns, axis=0)
    else:
        mgr = self._data

    if self._data._is_single_block or not self._data.is_mixed_type:
        return mgr.blocks[0].get_values()
    else:
        dtype = _interleaved_dtype(self.blocks)
        result = np.empty(self.shape, dtype=dtype)
        if result.shape[0] == 0:
            return result

        itemmask = np.zeros(self.shape[0])
        for blk in self.blocks:
            rl = blk.mgr_locs
            result[rl.indexer] = blk.get_values(dtype)
            itemmask[rl.indexer] = 1

        # vvv here is your final array assuming you actually have data
        return result 

def _consolidate_inplace(self):
    def f():
        if self._data.is_consolidated():
            return self._data

        bm = self._data.__class__(self._data.blocks, self._data.axes)
        bm._is_consolidated = False
        bm._consolidate_inplace()
        return bm
    self._protect_consolidate(f)

def _protect_consolidate(self, f):
    blocks_before = len(self._data.blocks)
    result = f()
    if len(self._data.blocks) != blocks_before:
        if i is not None:
            self._item_cache.pop(i, None)
        else:
            self._item_cache.clear()
    return result

请注意,df._datapandas.core.internals.BlockManager,而不是numpy.ndarray.

这篇关于为什么len在DataFrame上的效率比在底层numpy数组上的效率高得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆