Python MemoryError:无法分配数组内存 [英] Python MemoryError: cannot allocate array memory

查看:9875
本文介绍了Python MemoryError:无法分配数组内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个250 MB的CSV文件,我需要读取〜7000行和〜9000列。每一行代表一个图像,每一列都是一个像素(灰度值0-255)

I've got a 250 MB CSV file I need to read with ~7000 rows and ~9000 columns. Each row represents an image, and each column is a pixel (greyscale value 0-255)

我开始用一个简单的 np.loadtxt data / training_nohead.csv,delimiter =,)但这给我一个内存错误。我认为这是奇怪的,因为我运行64位Python安装了8 GB的内存,它只在使用大约512 MB后死了。

I started with a simple np.loadtxt("data/training_nohead.csv",delimiter=",") but this gave me a memory error. I thought this was strange since I'm running 64-bit Python with 8 gigs of memory installed and it died after using only around 512 MB.

SEVERAL其他策略,包括:

I've since tried SEVERAL other tactics, including:


  1. import fileinput ,在将整个文件读入后将它们附加到数组

  2. np.fromstring

  3. np.genfromtext

  4. 手动解析文件(因为所有数据都是整数,所以很容易编写)

  1. import fileinput and read one line at a time, appending them to an array
  2. np.fromstring after reading in the entire file
  3. np.genfromtext
  4. Manual parsing of the file (since all data is integers, this was fairly easy to code)

每种方法都给了我相同的结果。 MemoryError约512 MB。想知道如果有什么特别的512MB,我创建了一个简单的测试程序,填满内存直到python崩溃:

Every method gave me the same result. MemoryError around 512 MB. Wondering if there was something special about 512MB, I created a simple test program which filled up memory until python crashed:

str = " " * 511000000 # Start at 511 MB
while 1:
    str = str + " " * 1000 # Add 1 KB at a time

这样做直到大约1吉时才崩溃。我也,只是为了好玩,尝试: str =* 2048000000 (填充2 gigs) - 这没有运行。填充RAM,从来没有抱怨。所以问题不是我可以分配的RAM总量,但似乎是多少时间我可以分配内存...

Doing this didn't crash until around 1 gig. I also, just for fun, tried: str = " " * 2048000000 (fill 2 gigs) - this ran without a hitch. Filled the RAM and never complained. So the issue isn't the total amount of RAM I can allocate, but seems to be how many TIMES I can allocate memory...

我google'd周围无果直到我找到此文章:大型CSV文件上的Python内存不足(numpy)

I google'd around fruitlessly until I found this post: Python out of memory on large CSV file (numpy)

我从答案中复制了代码:

I copied the code from the answer exactly:

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

调用 iter_loadtxt(data / training_nohead.csv)此时给出了一个稍微不同的错误:

Calling iter_loadtxt("data/training_nohead.csv") gave a slightly different error this time:

MemoryError: cannot allocate array memory

Googling这个错误我只找到一个,不太有用,发布:内存错误(MemoryError)when创建一个布尔NumPy数组(Python)

Googling this error I only found one, not so helpful, post: Memory error (MemoryError) when creating a boolean NumPy array (Python)

在运行Python 2.7时,这不是我的问题。任何帮助将不胜感激。

As I'm running Python 2.7, this was not my issue. Any help would be appreciated.

推荐答案

来自@ J.F的一些帮助。 Sebastian我开发了以下答案:

With some help from @J.F. Sebastian I developed the following answer:

train = np.empty([7049,9246])
row = 0
for line in open("data/training_nohead.csv")
    train[row] = np.fromstring(line, sep=",")
    row += 1

当然,这个答案假定先前已知行数和列数。如果您之前没有这些信息,行数将总是需要一段时间来计算,因为您必须读取整个文件并计算 \\\
字符。类似这样的东西就足够了:

Of course this answer assumed prior knowledge of the number of rows and columns. Should you not have this information before-hand, the number of rows will always take a while to calculate as you have to read the entire file and count the \n characters. Something like this will suffice:

num_rows = 0
for line in open("data/training_nohead.csv")
    num_rows += 1

如果每行都有相同的列数那么你只能计数第一行,否则你需要跟踪最大值。

For number of columns if every row has the same number of columns then you can just count the first row, otherwise you need to keep track of the maximum.

num_rows = 0
max_cols = 0
for line in open("data/training_nohead.csv")
    num_rows += 1
    tmp = line.split(",")
    if len(tmp) > max_cols:
        max_cols = len(tmp)

此解决方案最适用于数值数据一个包含逗号的字符串可能会使事情变得复杂。

This solution works best for numerical data, as a string containing a comma could really complicate things.

这篇关于Python MemoryError:无法分配数组内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆