为什么DataFrames的串联速度变慢? [英] Why does concatenation of DataFrames get exponentially slower?
问题描述
我有一个处理DataFrame的函数,主要用于将数据处理到存储桶中,并使用pd.get_dummies(df[col])
在特定列中创建功能的二进制矩阵.
I have a function which processes a DataFrame, largely to process data into buckets create a binary matrix of features in a particular column using pd.get_dummies(df[col])
.
为避免立即使用此函数处理所有数据(该数据将耗尽内存并导致iPython崩溃),我使用以下方法将大型DataFrame分为多个块:
To avoid processing all of my data using this function at once (which goes out of memory and causes iPython to crash), I have broken the large DataFrame into chunks using:
chunks = (len(df) / 10000) + 1
df_list = np.array_split(df, chunks)
pd.get_dummies(df)
将基于df[col]
的内容自动创建新列,并且对于df_list
中的每个df
,这些列可能会有所不同.
pd.get_dummies(df)
will automatically create new columns based on the contents of df[col]
and these are likely to differ for each df
in df_list
.
处理后,我使用以下命令将DataFrame串联在一起:
After processing, I am concatenating the DataFrames back together using:
for i, df_chunk in enumerate(df_list):
print "chunk", i
[x, y] = preprocess_data(df_chunk)
super_x = pd.concat([super_x, x], axis=0)
super_y = pd.concat([super_y, y], axis=0)
print datetime.datetime.utcnow()
第一个块的处理时间是完全可以接受的,但是,每个块的处理时间会增加!这与preprocess_data(df_chunk)
无关,因为没有理由增加它.调用pd.concat()
是否会导致时间增加?
The processing time of the first chunk is perfectly acceptable, however, it grows per chunk! This is not to do with the preprocess_data(df_chunk)
as there is no reason for it to increase. Is this increase in time occurring as a result of the call to pd.concat()
?
请参见以下日志:
chunks 6
chunk 0
2016-04-08 00:22:17.728849
chunk 1
2016-04-08 00:22:42.387693
chunk 2
2016-04-08 00:23:43.124381
chunk 3
2016-04-08 00:25:30.249369
chunk 4
2016-04-08 00:28:11.922305
chunk 5
2016-04-08 00:32:00.357365
是否有解决方法来加快速度?我有2900个区块需要处理,因此可以帮助您!
Is there a workaround to speed this up? I have 2900 chunks to process so any help is appreciated!
在Python中打开其他建议!
Open to any other suggestions in Python!
推荐答案
切勿在for循环内调用DataFrame.append
或pd.concat
.导致二次复制.
Never call DataFrame.append
or pd.concat
inside a for-loop. It leads to quadratic copying.
pd.concat
返回一个新的DataFrame.必须为新的空间分配空间
DataFrame,并且旧DataFrame中的数据必须复制到新DataFrame中
数据框.考虑for-loop
中此行所需的复制量(假设每个x
的尺寸为1):
pd.concat
returns a new DataFrame. Space has to be allocated for the new
DataFrame, and data from the old DataFrames have to be copied into the new
DataFrame. Consider the amount of copying required by this line inside the for-loop
(assuming each x
has size 1):
super_x = pd.concat([super_x, x], axis=0)
| iteration | size of old super_x | size of x | copying required |
| 0 | 0 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 2 | 2 | 1 | 3 |
| ... | | | |
| N-1 | N-1 | 1 | N |
1 + 2 + 3 + ... + N = N(N+1)/2
.因此,需要O(N**2)
个副本以
完成循环.
1 + 2 + 3 + ... + N = N(N+1)/2
. So there is O(N**2)
copies required to
complete the loop.
现在考虑
super_x = []
for i, df_chunk in enumerate(df_list):
[x, y] = preprocess_data(df_chunk)
super_x.append(x)
super_x = pd.concat(super_x, axis=0)
追加到列表是O(1)
操作,不需要复制.现在
循环完成后,只需要调用pd.concat
.此通话至
pd.concat
需要制作N个副本,因为super_x
包含N
大小为1的DataFrame.因此,以这种方式构造时,super_x
要求O(N)
副本.
Appending to a list is an O(1)
operation and does not require copying. Now
there is a single call to pd.concat
after the loop is done. This call to
pd.concat
requires N copies to be made, since super_x
contains N
DataFrames of size 1. So when constructed this way, super_x
requires O(N)
copies.
这篇关于为什么DataFrames的串联速度变慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!