Pandas:在dropna之后因原位重命名而出现了特殊的性能下降 [英] Pandas: peculiar performance drop for inplace rename after dropna

查看:87
本文介绍了Pandas:在dropna之后因原位重命名而出现了特殊的性能下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 pandas问题中将此问题报告为此问题. 同时,我将其发布在这里,以期节省其他人的时间,以防他们遇到类似的问题.

I have reported this as an issue on pandas issues. In the meanwhile I post this here hoping to save others time, in case they encounter similar issues.

在对需要优化的过程进行性能分析时,我发现重命名不在适当位置的列可以将性能(执行时间)提高x120倍. 分析表明这与垃圾收集有关(请参见下文).

Upon profiling a process which needed to be optimized I found that renaming columns NOT inplace improves performance (execution time) by x120. Profiling indicates this is related to garbage collection (see below).

此外,通过避免使用dropna方法,可以恢复预期的性能.

Furthermore, the expected performance is recovered by avoiding the dropna method.

以下简短示例演示了x12因子:

The following short example demonstrates a factor x12:

import pandas as pd
import numpy as np

inplace = True

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
df = (df1-df2).dropna()
## inplace rename:
df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

100个循环,最好为3:每个循环15.6毫秒

100 loops, best of 3: 15.6 ms per loop

%%prun的第一条输出线:

ncalls tottime percall cumtime percall filename:lineno(function)

ncalls tottime percall cumtime percall filename:lineno(function)

1  0.018 0.018 0.018 0.018 {gc.collect}

inplace = False

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
df = (df1-df2).dropna()
## avoid inplace:
df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000次循环,每个循环最好进行3次:1.24毫秒

1000 loops, best of 3: 1.24 ms per loop

避免dropna

通过避免使用dropna方法来恢复预期的性能:

avoid dropna

The expected performance is recovered by avoiding the dropna method:

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
#no dropna:
df = (df1-df2)#.dropna()
## inplace rename:
df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

1000个循环,每个循环最好3:865 µs

1000 loops, best of 3: 865 µs per loop

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
## no dropna
df = (df1-df2)#.dropna()
## avoid inplace:
df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000个循环,最好为3:每个循环902 µs

1000 loops, best of 3: 902 µs per loop

推荐答案

这是github上解释的副本.

This is a copy of the explanation on github.

不能保证,实际上inplace操作会更快.通常,它们实际上是与副本上相同的操作,但是会重新分配顶级引用.

There is no guarantee that an inplace operation is actually faster. Often they are actually the same operation that works on a copy, but the top-level reference is reassigned.

在这种情况下,性能差异的原因如下.

The reason for the difference in performance in this case is as follows.

(df1-df2).dropna()调用将创建数据帧的一部分.当您应用新操作时,这会触发SettingWithCopy检查,因为它可能是副本(但通常不是).

The (df1-df2).dropna() call creates a slice of the dataframe. When you apply a new operation, this triggers a SettingWithCopy check because it could be a copy (but often is not).

此检查必须执行垃圾回收以清除一些缓存引用,以查看它是否是副本.不幸的是,python语法使这种情况不可避免.

This check must perform a garbage collection to wipe out some cache references to see if it's a copy. Unfortunately python syntax makes this unavoidable.

仅通过首先复制一个副本,您就不可能发生这种情况.

You can not have this happen, by simply making a copy first.

df = (df1-df2).dropna().copy()

后面跟随inplace操作的性能将与以前一样.

followed by an inplace operation will be as performant as before.

我的个人观点:我从不使用就地操作.该语法较难阅读,没有任何优势.

My personal opinion: I never use in-place operations. The syntax is harder to read and it does not offer any advantages.

这篇关于Pandas:在dropna之后因原位重命名而出现了特殊的性能下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆