为什么len在DataFrame上的效率比在底层numpy数组上的效率高得多? [英] why is len so much more efficient on DataFrame than on underlying numpy array?
问题描述
我注意到在DataFrame上使用len
比在底层numpy数组上使用len
快得多.我不明白为什么.通过shape
访问相同的信息也没有任何帮助.当我尝试获取列数和行数时,这更有意义.我一直在争论使用哪种方法.
I've noticed that using len
on a DataFrame is far quicker than using len
on the underlying numpy array. I don't understand why. Accessing the same information via shape
isn't any help either. This is more relevant as I try to get at the number of columns and number of rows. I was always debating which method to use.
我整理了以下实验,很明显,我将在数据帧上使用len
.但是有人可以解释为什么吗?
I put together the following experiment and it's very clear that I will be using len
on the dataframe. But can someone explain why?
from timeit import timeit
import pandas as pd
import numpy as np
ns = np.power(10, np.arange(6))
results = pd.DataFrame(
columns=ns,
index=pd.MultiIndex.from_product(
[['len', 'len(values)', 'shape'],
ns]))
dfs = {(n, m): pd.DataFrame(np.zeros((n, m))) for n in ns for m in ns}
for n, m in dfs.keys():
df = dfs[(n, m)]
results.loc[('len', n), m] = timeit('len(df)', 'from __main__ import df', number=10000)
results.loc[('len(values)', n), m] = timeit('len(df.values)', 'from __main__ import df', number=10000)
results.loc[('shape', n), m] = timeit('df.values.shape', 'from __main__ import df', number=10000)
fig, axes = plt.subplots(2, 3, figsize=(9, 6), sharex=True, sharey=True)
for i, (m, col) in enumerate(results.iteritems()):
r, c = i // 3, i % 3
col.unstack(0).plot.bar(ax=axes[r, c], title=m)
推荐答案
从各种方法来看,主要原因是构造numpy数组df.values
花费了大部分时间
From looking at the various methods, the main reason is that constructing the numpy array df.values
takes the lion's share of the time.
这两个速度很快,因为它们本质上是
These two are fast because they are essentially
len(df.index._data)
和
(len(df.index._data), len(df.columns._data))
其中_data
是numpy.ndarray
.因此,使用df.shape
的速度应该是len(df)
的一半,因为它可以找到df.index
和df.columns
(均为pd.Index
类型)的长度
where _data
is a numpy.ndarray
. Thus, using df.shape
should be half as fast as len(df)
because it's finding the length of both df.index
and df.columns
(both of type pd.Index
)
假设您已经提取了vals = df.values
.然后
Let's say you had already extracted vals = df.values
. Then
In [1]: df = pd.DataFrame(np.random.rand(1000, 10), columns=range(10))
In [2]: vals = df.values
In [3]: %timeit len(vals)
10000000 loops, best of 3: 35.4 ns per loop
In [4]: %timeit vals.shape
10000000 loops, best of 3: 51.7 ns per loop
相比:
In [5]: %timeit len(df.values)
100000 loops, best of 3: 3.55 µs per loop
因此瓶颈不是len
,而是df.values
的构造方式.如果您检查pandas.DataFrame.values()
,则会发现(大致等效)方法:
So the bottleneck is not len
but how df.values
is constructed. If you examine pandas.DataFrame.values()
, you'll find the (roughly equivalent) methods:
def values(self):
return self.as_matrix()
def as_matrix(self, columns=None):
self._consolidate_inplace()
if self._AXIS_REVERSED:
return self._data.as_matrix(columns).T
if len(self._data.blocks) == 0:
return np.empty(self._data.shape, dtype=float)
if columns is not None:
mgr = self._data.reindex_axis(columns, axis=0)
else:
mgr = self._data
if self._data._is_single_block or not self._data.is_mixed_type:
return mgr.blocks[0].get_values()
else:
dtype = _interleaved_dtype(self.blocks)
result = np.empty(self.shape, dtype=dtype)
if result.shape[0] == 0:
return result
itemmask = np.zeros(self.shape[0])
for blk in self.blocks:
rl = blk.mgr_locs
result[rl.indexer] = blk.get_values(dtype)
itemmask[rl.indexer] = 1
# vvv here is your final array assuming you actually have data
return result
def _consolidate_inplace(self):
def f():
if self._data.is_consolidated():
return self._data
bm = self._data.__class__(self._data.blocks, self._data.axes)
bm._is_consolidated = False
bm._consolidate_inplace()
return bm
self._protect_consolidate(f)
def _protect_consolidate(self, f):
blocks_before = len(self._data.blocks)
result = f()
if len(self._data.blocks) != blocks_before:
if i is not None:
self._item_cache.pop(i, None)
else:
self._item_cache.clear()
return result
请注意,df._data
是pandas.core.internals.BlockManager
,而不是numpy.ndarray
.
这篇关于为什么len在DataFrame上的效率比在底层numpy数组上的效率高得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!