Pandas:dropna 后就地重命名的特殊性能下降 [英] Pandas: peculiar performance drop for inplace rename after dropna

查看:31
本文介绍了Pandas:dropna 后就地重命名的特殊性能下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已将此问题报告为 pandas 问题.同时我把这个贴在这里希望能节省其他人的时间,以防他们遇到类似的问题.

在分析需要优化的过程时,我发现重命名未就地列的性能(执行时间)提高了 x120.分析表明这与垃圾收集有关(见下文).

此外,通过避免 dropna 方法恢复了预期的性能.

以下简短示例演示了一个因子 x12:

将pandas导入为pd将 numpy 导入为 np

就地=真

%%timeitnp.random.seed(0)r,c = (7,3)t = np.random.rand(r)df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)indx = np.random.choice(range(r),r/3, replace=False)t[indx] = np.random.rand(len(indx))df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)df = (df1-df2).dropna()##就地重命名:df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

<块引用>

100 个循环,最好的 3 个:每个循环 15.6 毫秒

%%prun 的第一行输出:

<块引用>

ncalls tottime percall cumtime percall filename:lineno(function)

1 0.018 0.018 0.018 0.018 {gc.collect}

就地=假

%%timeitnp.random.seed(0)r,c = (7,3)t = np.random.rand(r)df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)indx = np.random.choice(range(r),r/3, replace=False)t[indx] = np.random.rand(len(indx))df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)df = (df1-df2).dropna()## 避免就地:df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

<块引用>

1000 个循环,最好的 3 个:每个循环 1.24 毫秒

避免掉落

通过避免dropna方法恢复了预期的性能:

%%timeitnp.random.seed(0)r,c = (7,3)t = np.random.rand(r)df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)indx = np.random.choice(range(r),r/3, replace=False)t[indx] = np.random.rand(len(indx))df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)#no dropna:df = (df1-df2)#.dropna()##就地重命名:df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

<块引用>

1000 个循环,最好的 3 个:每个循环 865 微秒

%%timeitnp.random.seed(0)r,c = (7,3)t = np.random.rand(r)df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)indx = np.random.choice(range(r),r/3, replace=False)t[indx] = np.random.rand(len(indx))df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)##没有dropnadf = (df1-df2)#.dropna()## 避免就地:df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

<块引用>

1000 个循环,最好的 3 个:每个循环 902 微秒

解决方案

这是github上的解释.

不能保证inplace 操作实际上更快.通常,它们实际上是在副本上工作的相同操作,但是顶级引用被重新分配.

这种情况下性能差异的原因如下.

(df1-df2).dropna() 调用创建数据帧的一个切片.当您应用新操作时,这会触发 SettingWithCopy 检查,因为它可能是副本(但通常不是).

此检查必须执行垃圾收集以清除一些缓存引用,以查看它是否为副本.不幸的是,python 语法使这不可避免.

你不能让这种情况发生,只需先复制一份即可.

df = (df1-df2).dropna().copy()

之后是 inplace 操作将和以前一样性能.

我的个人意见:我从不使用就地操作.语法更难阅读,也没有任何优势.

I have reported this as an issue on pandas issues. In the meanwhile I post this here hoping to save others time, in case they encounter similar issues.

Upon profiling a process which needed to be optimized I found that renaming columns NOT inplace improves performance (execution time) by x120. Profiling indicates this is related to garbage collection (see below).

Furthermore, the expected performance is recovered by avoiding the dropna method.

The following short example demonstrates a factor x12:

import pandas as pd
import numpy as np

inplace=True

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
df = (df1-df2).dropna()
## inplace rename:
df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

100 loops, best of 3: 15.6 ms per loop

first output line of %%prun:

ncalls tottime percall cumtime percall filename:lineno(function)

1  0.018 0.018 0.018 0.018 {gc.collect}

inplace=False

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
df = (df1-df2).dropna()
## avoid inplace:
df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000 loops, best of 3: 1.24 ms per loop

avoid dropna

The expected performance is recovered by avoiding the dropna method:

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
#no dropna:
df = (df1-df2)#.dropna()
## inplace rename:
df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

1000 loops, best of 3: 865 µs per loop

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
## no dropna
df = (df1-df2)#.dropna()
## avoid inplace:
df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000 loops, best of 3: 902 µs per loop

解决方案

This is a copy of the explanation on github.

There is no guarantee that an inplace operation is actually faster. Often they are actually the same operation that works on a copy, but the top-level reference is reassigned.

The reason for the difference in performance in this case is as follows.

The (df1-df2).dropna() call creates a slice of the dataframe. When you apply a new operation, this triggers a SettingWithCopy check because it could be a copy (but often is not).

This check must perform a garbage collection to wipe out some cache references to see if it's a copy. Unfortunately python syntax makes this unavoidable.

You can not have this happen, by simply making a copy first.

df = (df1-df2).dropna().copy()

followed by an inplace operation will be as performant as before.

My personal opinion: I never use in-place operations. The syntax is harder to read and it does not offer any advantages.

这篇关于Pandas:dropna 后就地重命名的特殊性能下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆