为什么Pandas串联(pandas.concat)如此内存效率低下? [英] Why is Pandas Concatenation (pandas.concat) so Memory Inefficient?
问题描述
我有大约30 GB的数据(在大约900个数据帧的列表中),我试图将它们串联在一起.我正在使用的机器是功能强大的Linux Box,带有大约256 GB的ram.但是,当我尝试连接文件时,我很快就用光了可用内存.我已经尝试了各种解决方法来解决此问题(使用for循环以较小的批次进行串联等),但是我仍然无法将它们串联起来.我想到两个问题:
I have about 30 GB of data (in a list of about 900 dataframes) that I am attempting to concatenate together. The machine I am working with is a moderately powerful Linux Box with about 256 GB of ram. However, when I try to concatenate my files I quickly run out of available ram. I have tried all sorts of workarounds to fix this (concatenating in smaller batches with for loops, etc.) but I still cannot get these to concatenate. Two questions spring to mind:
-
是否还有其他人对此进行了处理并找到了有效的解决方法?我不能使用直接追加,因为我需要
pd.concat()
中join='outer'
参数的列合并"(由于缺少更好的词)功能.
Has anyone else dealt with this and found an effective workaround? I cannot use a straight append because I need the 'column merging' (for lack of a better word) functionality of the
join='outer'
argument inpd.concat()
.
为什么Pandas的串联(我知道只是调用numpy.concatenate
)如此低效地利用了内存?
Why is Pandas concatenation (which I know is just calling numpy.concatenate
) so inefficient with its use of memory?
我还应该注意,我认为问题不在于列的爆炸式增长,因为将100个数据框连接在一起可得到约3000列,而基本数据框则约为1000.
I should also note that I do not think the problem is an explosion of columns as concatenating 100 of the dataframes together gives about 3000 columns whereas the base dataframe has about 1000.
我正在使用的数据是针对我的900个数据帧中的每一个,大约1000列宽,大约50,000行深的财务数据.从左到右遍历的数据类型为:
The data I am working with is financial data about 1000 columns wide and about 50,000 rows deep for each of my 900 dataframes. The types of data going across left to right are:
- 日期为字符串格式,
-
string
-
np.float
-
int
- date in string format,
string
np.float
int
...等重复.我正在使用外部联接对列名称进行串联,这意味着df2
中不在df1
中的任何列都不会被丢弃,而是被分流到一边.
... and so on repeating. I am concatenating on column name with an outer join which means that any columns in df2
that are not in df1
will not be discarded but shunted off to the side.
#example code
data=pd.concat(datalist4, join="outer", axis=0, ignore_index=True)
#two example dataframes (about 90% of the column names should be in common
#between the two dataframes, the unnamed columns, etc are not a significant
#number of the columns)
print datalist4[0].head()
800_1 800_2 800_3 800_4 900_1 900_2 0 2014-08-06 09:00:00 BEST_BID 1117.1 103 2014-08-06 09:00:00 BEST_BID
1 2014-08-06 09:00:00 BEST_ASK 1120.0 103 2014-08-06 09:00:00 BEST_ASK
2 2014-08-06 09:00:00 BEST_BID 1106.9 11 2014-08-06 09:00:00 BEST_BID
3 2014-08-06 09:00:00 BEST_ASK 1125.8 62 2014-08-06 09:00:00 BEST_ASK
4 2014-08-06 09:00:00 BEST_BID 1117.1 103 2014-08-06 09:00:00 BEST_BID
900_3 900_4 1000_1 1000_2 ... 2400_4 0 1017.2 103 2014-08-06 09:00:00 BEST_BID ... NaN
1 1020.1 103 2014-08-06 09:00:00 BEST_ASK ... NaN
2 1004.3 11 2014-08-06 09:00:00 BEST_BID ... NaN
3 1022.9 11 2014-08-06 09:00:00 BEST_ASK ... NaN
4 1006.7 10 2014-08-06 09:00:00 BEST_BID ... NaN
_1 _2 _3 _4 _1.1 _2.1 _3.1 _4.1 0 #N/A Invalid Security NaN NaN NaN #N/A Invalid Security NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN
dater
0 2014.8.6
1 2014.8.6
2 2014.8.6
3 2014.8.6
4 2014.8.6
[5 rows x 777 columns]
print datalist4[1].head()
150_1 150_2 150_3 150_4 200_1 200_2 0 2013-12-04 09:00:00 BEST_BID 1639.6 30 2013-12-04 09:00:00 BEST_ASK
1 2013-12-04 09:00:00 BEST_ASK 1641.8 133 2013-12-04 09:00:08 BEST_BID
2 2013-12-04 09:00:01 BEST_BID 1639.5 30 2013-12-04 09:00:08 BEST_ASK
3 2013-12-04 09:00:05 BEST_BID 1639.4 30 2013-12-04 09:00:08 BEST_ASK
4 2013-12-04 09:00:08 BEST_BID 1639.3 133 2013-12-04 09:00:08 BEST_BID
200_3 200_4 250_1 250_2 ... 2500_1 0 1591.9 133 2013-12-04 09:00:00 BEST_BID ... 2013-12-04 10:29:41
1 1589.4 30 2013-12-04 09:00:00 BEST_ASK ... 2013-12-04 11:59:22
2 1591.6 103 2013-12-04 09:00:01 BEST_BID ... 2013-12-04 11:59:23
3 1591.6 133 2013-12-04 09:00:04 BEST_BID ... 2013-12-04 11:59:26
4 1589.4 133 2013-12-04 09:00:07 BEST_BID ... 2013-12-04 11:59:29
2500_2 2500_3 2500_4 Unnamed: 844_1 Unnamed: 844_2 0 BEST_ASK 0.35 50 #N/A Invalid Security NaN
1 BEST_ASK 0.35 11 NaN NaN
2 BEST_ASK 0.40 11 NaN NaN
3 BEST_ASK 0.45 11 NaN NaN
4 BEST_ASK 0.50 21 NaN NaN
Unnamed: 844_3 Unnamed: 844_4 Unnamed: 848_1 dater
0 NaN NaN #N/A Invalid Security 2013.12.4
1 NaN NaN NaN 2013.12.4
2 NaN NaN NaN 2013.12.4
3 NaN NaN NaN 2013.12.4
4 NaN NaN NaN 2013.12.4
[5 rows x 850 columns]
推荐答案
我遇到了将大量DataFrame与不断增长的" DataFrame连接在一起的性能问题.我的解决方法是将所有子DataFrame追加到列表中,然后在完成子DataFrame的处理后将DataFrame列表并置.
I've had performance issues concatenating a large number of DataFrames to a 'growing' DataFrame. My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed.
这篇关于为什么Pandas串联(pandas.concat)如此内存效率低下?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!