pandas 表演:栏目选择 [英] pandas performance: columns selection

查看:98
本文介绍了 pandas 表演:栏目选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我今天观察到,选择两列或更多列数据框可能比仅选择一列慢得多.

I've observed today that selecting two or more columns of Data frame may be much slower than selecting only one.

如果我使用loc或iloc选择一个以上的列,并且使用list传递列名或索引,则与使用iloc的单列或多列选择相比,性能下降了100倍(但未传递任何列表)

If I use loc, or iloc to choose more than one column and I use list to pass column names or indexes, then performance drops 100 times in comparison to single column or many column selection with iloc (but no list passed)

示例:

df = pd.DataFrame(np.random.randn(10**7,10), columns=list('abcdefghij'))

一列选择:

%%timeit -n 100
df['b']
3.17 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
df.iloc[:,1]
66.7 µs ± 5.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
df.loc[:,'b']
44.2 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

两列选择:

%%timeit -n 10
df[['b', 'c']]
96.4 ms ± 788 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
df.loc[:,['b', 'c']]
99.4 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
df.iloc[:,[1,2]]
97.6 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

仅此选择可以像预期的那样工作:

Only this selection works like expected:

%%timeit -n 100
df.iloc[:,1:3]
103 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

机制上有何不同?为何如此之大?

What are the differences in mechanisms and why they are so big?

: 正如@ run-out指出的那样,pd.Series的处理似乎比pd.DataFrame快得多,有人知道为什么会这样吗?

: As @run-out pointed out, pd.Series seems to be processed much faster than pd.DataFrame, anyone knows why it is the case?

另一方面-它不能解释df.iloc[:,[1,2]]df.iloc[:,1:3]

On the other hand - it does not explain difference between df.iloc[:,[1,2]] and df.iloc[:,1:3]

推荐答案

Pandas作为pandas.Series使用单行或单列,这比在DataFrame体系结构中工作要快.

Pandas works with single rows or columns as a pandas.Series, which would be faster than working within the DataFrame architecture.

当您要求时,Pandas可与pandas.Series配合使用

Pandas works with pandas.Series when you ask for:

%%timeit -n 10
df['b']
2.31 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

但是,我可以通过将其放在列表中来为同一列调用DataFrame.然后您得到:

However, I can call a DataFrame for the same column by putting it in a list. Then you get:

%%timeit -n 10
df[['b']]
90.7 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

从上面您可以看到,表现优于DataFrame的是Series.

You can see from the above that it's the Series that is outperforming the DataFrame.

这是Pandas如何处理列"b".

Here is how Pandas is working with column 'b'.

type(df['b'])
pandas.core.series.Series

type(df[['b']])
pandas.core.frame.DataFrame

我正在扩展我的答案,因为OP希望更深入地研究为什么pd.series与pd.dataframe的速度如此之大.同样,这也是一个扩展我/我们对底层技术如何工作的理解的好问题.那些有更多专业知识的人请报名参加.

I'm expanding on my answer as OP wants to dig deeper into why there is greater speed is so much greater for pd.series vs. pd.dataframe. And also as this is a great question to expand my/our understanding of how the underlying technology works. Those with more expertise please chime in.

首先让我们从numpy开始,因为它是熊猫的构建基块.根据pandas的作者和Python进行数据分析的韦斯·麦金尼(Wes McKinney)的说法,性能在numpy之上超过了python:

First let's start with numpy as it's a building block of pandas. According to Wes McKinney, author of pandas and from Python for Data Analysis, the performance pick up in numpy over python:

This is based partly on performance differences having to do with the
cache hierarchy of the CPU; operations accessing contiguous blocks of memory (e.g.,
summing the rows of a C order array) will generally be the fastest because the mem‐
ory subsystem will buffer the appropriate blocks of memory into the ultrafast L1 or
L2 CPU cache. 

让我们看一下此示例的速度差异.让我们从数据帧的列"b"中创建一个numpy数组.

Let's see the speed difference for this example. Let's make a numpy array from column 'b' of the dataframe.

a = np.array(df['b'])

现在进行性能测试:

%%timeit -n 10
a

结果是:

32.5 ns ± 28.2 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

这是在2.31 µs的pd.series时间内性能的严重提高.

That's a serious pick up in performance over the pd.series time of 2.31 µs.

提高性能的另一个主要原因是numpy索引直接进入了NumPy C扩展,但是当您对Series进行索引时,有很多python东西在工作,而且速度慢得多. (阅读本文)

The other main reason for performance pickup is that numpy indexing goes straight into NumPy C extensions, but there is a lot of python stuff going on when you index into a Series, and this is a lot slower. (read this article)

让我们看一下为什么这样做的问题:

Let's look at the question of why does:

df.iloc[:,1:3]

明显胜过:

df.iloc[:,[1,2]]

有趣的是,在这种情况下,.loc具有与.iloc相同的性能效果.

It's interesting to note that .loc has the same effect with performance as .iloc in this scenario.

我们的第一个大提示是以下代码:

Our first big clue that something is not right is in the following code:

df.iloc[:,1:3] is df.iloc[:,[1,2]]
False

这些给出相同的结果,但是是不同的对象.我已经进行了深入的尝试,以找出有什么不同.我无法在互联网上或在我的书库中找到对此的参考.

These give the same result, but are different objects. I've done a deep dive try to find out what the difference is. I was unable to find reference to this on the internet or in my library of books.

看一下源代码,我们可以开始看到一些区别.我指的是indexing.py.

Looking at the source code, we can start to see some difference. I refer to indexing.py.

在_iLocIndexer类中,我们可以发现熊猫正在做一些额外的工作,以便在iloc切片中列出.

In the Class _iLocIndexer we can find some extra work being done by pandas for list in an iloc slice.

马上,我们在检查输入时会遇到这两个区别:

Right away, we run into these two difference when checking input:

if isinstance(key, slice):
            return

vs.

elif is_list_like_indexer(key):
            # check that the key does not exceed the maximum size of the index
            arr = np.array(key)
            l = len(self.obj._get_axis(axis))

            if len(arr) and (arr.max() >= l or arr.min() < -l):
                raise IndexError("positional indexers are out-of-bounds")

这是否足以导致性能下降?我不知道.

Could this alone be cause enough for the reduced performance? I don't know.

尽管.loc稍有不同,但使用值列表时它也会降低性能.在index.py中,查看def _getitem_axis(self,key,axis = None):->在类_LocIndexer(_LocationIndexer)中:

Although .loc is slightly different, it also suffers performance when using a list of values. Looking in index.py, look at def _getitem_axis(self, key, axis=None): --> in class _LocIndexer(_LocationIndexer):

is_list_like_indexer(key)的用于处理列表输入的代码段很长,其中包括很多开销.它包含注释:

The code section for is_list_like_indexer(key) that handles list inputs is quite long including a lot of overhead. It contains the note:

# convert various list-like indexers
# to a list of keys
# we will use the *values* of the object
# and NOT the index if its a PandasObject

在处理值列表或整数列表时,确实存在足够的额外开销,然后直接引导切片导致处理延迟.

Certainly there is enough additional overhead in dealing with a list of values or integers then direct slices to cause delays in processing.

其余代码超出了我的薪水等级.如果有人可以欣赏和欣赏,那将是非常受欢迎的

The rest of the code is past my pay grade. If anyone can have a look and chime it, it would be most welcome

这篇关于 pandas 表演:栏目选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆