带有pandas.concat和numpy.append的大数据集的内存错误 [英] Memory error with large data sets for pandas.concat and numpy.append

查看:179
本文介绍了带有pandas.concat和numpy.append的大数据集的内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我面临的一个问题是我必须循环生成大型DataFrame(每两个2000 x 800 pandas DataFrame进行50次迭代计算).我想将结果保存在更大的DataFrame或类似结构的字典中的内存中. 使用pandas.concat时,在循环中的某个时刻出现内存错误.使用numpy.append将结果存储在numpy数组的字典中而不是在DataFrame中时,也会发生同样的情况.在这两种情况下,我仍然有很多可用内存(几个GB).这是熊猫或numpy处理的太多数据吗?有没有更节省内存的方法来存储我的数据而不将其保存在磁盘上?

I am facing a problem where I have to generate large DataFrames in a loop (50 iterations computing every time two 2000 x 800 pandas DataFrames). I would like to keep the results in memory in a bigger DataFrame, or in a dictionary like structure. When using pandas.concat, I get a memory error at some point in the loop. The same happens when using numpy.append to store the results in a dictionary of numpy arrays rather than in a DataFrame. In both cases, I still have a lot of available memory (several GB). Is this too much data for pandas or numpy to process? Are there more memory-efficient ways to store my data without saving it on disk?

例如,以下脚本在nbIds大于376时立即失败:

As an example, the following script fails as soon as nbIds is greater than 376:

import pandas as pd
import numpy as np
nbIds = 376
dataids = range(nbIds)
dataCollection1 = []
dataCollection2 = []
for bs in range(50):
    newData1 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection1.append( newData1 )
    newData2 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection2.append( newData2 )
dataCollection1 = pd.concat(dataCollection1).reset_index(drop = True)
dataCollection2 = pd.concat(dataCollection2).reset_index(drop = True)

nbIds为665或更高版本时,以下代码失败

The code below fails when nbIds is 665 or higher

import pandas as pd
import numpy as np
nbIds = 665
dataids = range(nbIds)
dataCollection1 = dict( (i , np.array([])) for i in dataids )
dataCollection2 = dict( (i , np.array([])) for i in dataids )
for bs in range(50):
    newData1 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids )))
    newData1 = pd.DataFrame(newData1)
    newData2 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids)))
    newData2 = pd.DataFrame(newData2)
    for i in dataids :
        dataCollection1[i] = np.append(dataCollection1[i] , 
                                       np.array(newData1[i]))
        dataCollection2[i] = np.append(dataCollection2[i] , 
                                       np.array(newData2[i]))

我确实需要每次都计算两个DataFrame,并且对于dataids的每个元素i,我需要获得一个熊猫系列或一个numpy数组,其中包含为i生成的50 * 2000数字.理想情况下,我需要能够使用等于或大于800的nbIds运行它. 有一个简单的方法可以做到这一点吗?

I do need to compute both DataFrames everytime, and for each element i of dataids I need to obtain a pandas Series or a numpy array containing the 50 * 2000 numbers generated for i. Ideally, I need to be able to run this with nbIds equal to 800 or more. Is there a straightforward way of doing this?

我正在将32位Python与Python 2.7.5,pandas 0.12.0和numpy 1.7.1结合使用.

I am using a 32-bit Python with Python 2.7.5, pandas 0.12.0 and numpy 1.7.1.

非常感谢您的帮助!

推荐答案

正如usetheathathstar,Boud和Jeff在评论中所建议的那样,切换到64位python可以解决问题.
如果不存在精度损失的问题,那么按照Jeff的建议使用float32数据类型也将增加在32位环境中可以处理的数据量.

As suggested by usethedeathstar, Boud and Jeff in the comments, switching to a 64-bit python does the trick.
If losing precision is not an issue, using float32 data type as suggested by Jeff also increases the amount of data that can be processed in a 32-bit environment.

这篇关于带有pandas.concat和numpy.append的大数据集的内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆