为什么Pandas串联(pandas.concat)如此内存效率低下? [英] Why is Pandas Concatenation (pandas.concat) so Memory Inefficient?

查看:729
本文介绍了为什么Pandas串联(pandas.concat)如此内存效率低下?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约30 GB的数据(在大约900个数据帧的列表中),我试图将它们串联在一起.我正在使用的机器是功能强大的Linux Box,带有大约256 GB的ram.但是,当我尝试连接文件时,我很快就用光了可用内存.我已经尝试了各种解决方法来解决此问题(使用for循环以较小的批次进行串联等),但是我仍然无法将它们串联起来.我想到两个问题:

I have about 30 GB of data (in a list of about 900 dataframes) that I am attempting to concatenate together. The machine I am working with is a moderately powerful Linux Box with about 256 GB of ram. However, when I try to concatenate my files I quickly run out of available ram. I have tried all sorts of workarounds to fix this (concatenating in smaller batches with for loops, etc.) but I still cannot get these to concatenate. Two questions spring to mind:

  1. 是否还有其他人对此进行了处理并找到了有效的解决方法?我不能使用直接追加,因为我需要pd.concat()join='outer'参数的列合并"(由于缺少更好的词)功能.

  1. Has anyone else dealt with this and found an effective workaround? I cannot use a straight append because I need the 'column merging' (for lack of a better word) functionality of the join='outer' argument in pd.concat().

为什么Pandas的串联(我知道只是调用numpy.concatenate)如此低效地利用了内存?

Why is Pandas concatenation (which I know is just calling numpy.concatenate) so inefficient with its use of memory?

我还应该注意,我认为问题不在于列的爆炸式增长,因为将100个数据框连接在一起可得到约3000列,而基本数据框则约为1000.

I should also note that I do not think the problem is an explosion of columns as concatenating 100 of the dataframes together gives about 3000 columns whereas the base dataframe has about 1000.

我正在使用的数据是针对我的900个数据帧中的每一个,大约1000列宽,大约50,000行深的财务数据.从左到右遍历的数据类型为:

The data I am working with is financial data about 1000 columns wide and about 50,000 rows deep for each of my 900 dataframes. The types of data going across left to right are:

  1. 日期为字符串格式,
  2. string
  3. np.float
  4. int
  1. date in string format,
  2. string
  3. np.float
  4. int

...等重复.我正在使用外部联接对列名称进行串联,这意味着df2中不在df1中的任何列都不会被丢弃,而是被分流到一边.

... and so on repeating. I am concatenating on column name with an outer join which means that any columns in df2 that are not in df1 will not be discarded but shunted off to the side.

 #example code
 data=pd.concat(datalist4, join="outer", axis=0, ignore_index=True)
 #two example dataframes (about 90% of the column names should be in common
 #between the two dataframes, the unnamed columns, etc are not a significant
 #number of the columns)

print datalist4[0].head()
                800_1     800_2   800_3  800_4               900_1     900_2  0 2014-08-06 09:00:00  BEST_BID  1117.1    103 2014-08-06 09:00:00  BEST_BID   
1 2014-08-06 09:00:00  BEST_ASK  1120.0    103 2014-08-06 09:00:00  BEST_ASK   
2 2014-08-06 09:00:00  BEST_BID  1106.9     11 2014-08-06 09:00:00  BEST_BID   
3 2014-08-06 09:00:00  BEST_ASK  1125.8     62 2014-08-06 09:00:00  BEST_ASK   
4 2014-08-06 09:00:00  BEST_BID  1117.1    103 2014-08-06 09:00:00  BEST_BID   

    900_3  900_4              1000_1    1000_2    ...     2400_4  0  1017.2    103 2014-08-06 09:00:00  BEST_BID    ...        NaN   
1  1020.1    103 2014-08-06 09:00:00  BEST_ASK    ...        NaN   
2  1004.3     11 2014-08-06 09:00:00  BEST_BID    ...        NaN   
3  1022.9     11 2014-08-06 09:00:00  BEST_ASK    ...        NaN   
4  1006.7     10 2014-08-06 09:00:00  BEST_BID    ...        NaN   

                      _1  _2  _3  _4                   _1.1 _2.1 _3.1  _4.1  0  #N/A Invalid Security NaN NaN NaN  #N/A Invalid Security  NaN  NaN   NaN   
1                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
2                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
3                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
4                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   

      dater  
0  2014.8.6  
1  2014.8.6  
2  2014.8.6  
3  2014.8.6  
4  2014.8.6  

[5 rows x 777 columns]

print datalist4[1].head()
                150_1     150_2   150_3  150_4               200_1     200_2  0 2013-12-04 09:00:00  BEST_BID  1639.6     30 2013-12-04 09:00:00  BEST_ASK   
1 2013-12-04 09:00:00  BEST_ASK  1641.8    133 2013-12-04 09:00:08  BEST_BID   
2 2013-12-04 09:00:01  BEST_BID  1639.5     30 2013-12-04 09:00:08  BEST_ASK   
3 2013-12-04 09:00:05  BEST_BID  1639.4     30 2013-12-04 09:00:08  BEST_ASK   
4 2013-12-04 09:00:08  BEST_BID  1639.3    133 2013-12-04 09:00:08  BEST_BID   

    200_3  200_4               250_1     250_2    ...                 2500_1  0  1591.9    133 2013-12-04 09:00:00  BEST_BID    ...    2013-12-04 10:29:41   
1  1589.4     30 2013-12-04 09:00:00  BEST_ASK    ...    2013-12-04 11:59:22   
2  1591.6    103 2013-12-04 09:00:01  BEST_BID    ...    2013-12-04 11:59:23   
3  1591.6    133 2013-12-04 09:00:04  BEST_BID    ...    2013-12-04 11:59:26   
4  1589.4    133 2013-12-04 09:00:07  BEST_BID    ...    2013-12-04 11:59:29   

     2500_2 2500_3 2500_4         Unnamed: 844_1  Unnamed: 844_2  0  BEST_ASK   0.35     50  #N/A Invalid Security             NaN   
1  BEST_ASK   0.35     11                    NaN             NaN   
2  BEST_ASK   0.40     11                    NaN             NaN   
3  BEST_ASK   0.45     11                    NaN             NaN   
4  BEST_ASK   0.50     21                    NaN             NaN   

  Unnamed: 844_3 Unnamed: 844_4         Unnamed: 848_1      dater  
0            NaN            NaN  #N/A Invalid Security  2013.12.4  
1            NaN            NaN                    NaN  2013.12.4  
2            NaN            NaN                    NaN  2013.12.4  
3            NaN            NaN                    NaN  2013.12.4  
4            NaN            NaN                    NaN  2013.12.4  

[5 rows x 850 columns]

推荐答案

我遇到了将大量DataFrame与不断增长的" DataFrame连接在一起的性能问题.我的解决方法是将所有子DataFrame追加到列表中,然后在完成子DataFrame的处理后将DataFrame列表并置.

I've had performance issues concatenating a large number of DataFrames to a 'growing' DataFrame. My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed.

这篇关于为什么Pandas串联(pandas.concat)如此内存效率低下?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆