带有pandas.concat和numpy.append的大数据集的内存错误 [英] Memory error with large data sets for pandas.concat and numpy.append

查看：179 发布时间：2020/5/18 21:32:52 python python-2.7 numpy pandas

本文介绍了带有pandas.concat和numpy.append的大数据集的内存错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我面临的一个问题是我必须循环生成大型DataFrame(每两个2000 x 800 pandas DataFrame进行50次迭代计算).我想将结果保存在更大的DataFrame或类似结构的字典中的内存中. 使用pandas.concat时，在循环中的某个时刻出现内存错误.使用numpy.append将结果存储在numpy数组的字典中而不是在DataFrame中时，也会发生同样的情况.在这两种情况下，我仍然有很多可用内存(几个GB).这是熊猫或numpy处理的太多数据吗?有没有更节省内存的方法来存储我的数据而不将其保存在磁盘上?

I am facing a problem where I have to generate large DataFrames in a loop (50 iterations computing every time two 2000 x 800 pandas DataFrames). I would like to keep the results in memory in a bigger DataFrame, or in a dictionary like structure. When using pandas.concat, I get a memory error at some point in the loop. The same happens when using numpy.append to store the results in a dictionary of numpy arrays rather than in a DataFrame. In both cases, I still have a lot of available memory (several GB). Is this too much data for pandas or numpy to process? Are there more memory-efficient ways to store my data without saving it on disk?

例如，以下脚本在nbIds大于376时立即失败:

As an example, the following script fails as soon as nbIds is greater than 376:

import pandas as pd
import numpy as np
nbIds = 376
dataids = range(nbIds)
dataCollection1 = []
dataCollection2 = []
for bs in range(50):
    newData1 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection1.append( newData1 )
    newData2 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection2.append( newData2 )
dataCollection1 = pd.concat(dataCollection1).reset_index(drop = True)
dataCollection2 = pd.concat(dataCollection2).reset_index(drop = True)

nbIds为665或更高版本时，以下代码失败

The code below fails when nbIds is 665 or higher

import pandas as pd
import numpy as np
nbIds = 665
dataids = range(nbIds)
dataCollection1 = dict( (i , np.array([])) for i in dataids )
dataCollection2 = dict( (i , np.array([])) for i in dataids )
for bs in range(50):
    newData1 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids )))
    newData1 = pd.DataFrame(newData1)
    newData2 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids)))
    newData2 = pd.DataFrame(newData2)
    for i in dataids :
        dataCollection1[i] = np.append(dataCollection1[i] , 
                                       np.array(newData1[i]))
        dataCollection2[i] = np.append(dataCollection2[i] , 
                                       np.array(newData2[i]))

我确实需要每次都计算两个DataFrame，并且对于dataids的每个元素i，我需要获得一个熊猫系列或一个numpy数组，其中包含为i生成的50 * 2000数字.理想情况下，我需要能够使用等于或大于800的nbIds运行它. 有一个简单的方法可以做到这一点吗?

I do need to compute both DataFrames everytime, and for each element i of dataids I need to obtain a pandas Series or a numpy array containing the 50 * 2000 numbers generated for i. Ideally, I need to be able to run this with nbIds equal to 800 or more. Is there a straightforward way of doing this?

我正在将32位Python与Python 2.7.5，pandas 0.12.0和numpy 1.7.1结合使用.

I am using a 32-bit Python with Python 2.7.5, pandas 0.12.0 and numpy 1.7.1.

非常感谢您的帮助！

带有pandas.concat和numpy.append的大数据集的内存错误 [英] Memory error with large data sets for pandas.concat and numpy.append

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

带有pandas.concat和numpy.append的大数据集的内存错误 [英] Memory error with large data sets for pandas.concat and numpy.append

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭