为什么DataFrames的串联速度变慢? [英] Why does concatenation of DataFrames get exponentially slower?

查看:77
本文介绍了为什么DataFrames的串联速度变慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个处理DataFrame的函数,主要用于将数据处理到存储桶中,并使用pd.get_dummies(df[col])在特定列中创建功能的二进制矩阵.

I have a function which processes a DataFrame, largely to process data into buckets create a binary matrix of features in a particular column using pd.get_dummies(df[col]).

为避免立即使用此函数处理所有数据(该数据将耗尽内存并导致iPython崩溃),我使用以下方法将大型DataFrame分为多个块:

To avoid processing all of my data using this function at once (which goes out of memory and causes iPython to crash), I have broken the large DataFrame into chunks using:

chunks = (len(df) / 10000) + 1
df_list = np.array_split(df, chunks)

pd.get_dummies(df)将基于df[col]的内容自动创建新列,并且对于df_list中的每个df,这些列可能会有所不同.

pd.get_dummies(df) will automatically create new columns based on the contents of df[col] and these are likely to differ for each df in df_list.

处理后,我使用以下命令将DataFrame串联在一起:

After processing, I am concatenating the DataFrames back together using:

for i, df_chunk in enumerate(df_list):
    print "chunk", i
    [x, y] = preprocess_data(df_chunk)
    super_x = pd.concat([super_x, x], axis=0)
    super_y = pd.concat([super_y, y], axis=0)
    print datetime.datetime.utcnow()

第一个块的处理时间是完全可以接受的,但是,每个块的处理时间会增加!这与preprocess_data(df_chunk)无关,因为没有理由增加它.调用pd.concat()是否会导致时间增加?

The processing time of the first chunk is perfectly acceptable, however, it grows per chunk! This is not to do with the preprocess_data(df_chunk) as there is no reason for it to increase. Is this increase in time occurring as a result of the call to pd.concat()?

请参见以下日志:

chunks 6
chunk 0
2016-04-08 00:22:17.728849
chunk 1
2016-04-08 00:22:42.387693 
chunk 2
2016-04-08 00:23:43.124381
chunk 3
2016-04-08 00:25:30.249369
chunk 4
2016-04-08 00:28:11.922305
chunk 5
2016-04-08 00:32:00.357365

是否有解决方法来加快速度?我有2900个区块需要处理,因此可以帮助您!

Is there a workaround to speed this up? I have 2900 chunks to process so any help is appreciated!

在Python中打开其他建议!

Open to any other suggestions in Python!

推荐答案

切勿在for循环内调用DataFrame.appendpd.concat.导致二次复制.

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

pd.concat返回一个新的DataFrame.必须为新的空间分配空间 DataFrame,并且旧DataFrame中的数据必须复制到新DataFrame中 数据框.考虑for-loop中此行所需的复制量(假设每个x的尺寸为1):

pd.concat returns a new DataFrame. Space has to be allocated for the new DataFrame, and data from the old DataFrames have to be copied into the new DataFrame. Consider the amount of copying required by this line inside the for-loop (assuming each x has size 1):

super_x = pd.concat([super_x, x], axis=0)

| iteration | size of old super_x | size of x | copying required |
|         0 |                   0 |         1 |                1 |
|         1 |                   1 |         1 |                2 |
|         2 |                   2 |         1 |                3 |
|       ... |                     |           |                  |
|       N-1 |                 N-1 |         1 |                N |

1 + 2 + 3 + ... + N = N(N+1)/2.因此,需要O(N**2)个副本以 完成循环.

1 + 2 + 3 + ... + N = N(N+1)/2. So there is O(N**2) copies required to complete the loop.

现在考虑

super_x = []
for i, df_chunk in enumerate(df_list):
    [x, y] = preprocess_data(df_chunk)
    super_x.append(x)
super_x = pd.concat(super_x, axis=0)

追加到列表是O(1)操作,不需要复制.现在 循环完成后,只需要调用pd.concat.此通话至 pd.concat需要制作N个副本,因为super_x包含N 大小为1的DataFrame.因此,以这种方式构造时,super_x要求O(N) 副本.

Appending to a list is an O(1) operation and does not require copying. Now there is a single call to pd.concat after the loop is done. This call to pd.concat requires N copies to be made, since super_x contains N DataFrames of size 1. So when constructed this way, super_x requires O(N) copies.

这篇关于为什么DataFrames的串联速度变慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆