大型CSV文件(numpy)上的Python内存不足 [英] Python out of memory on large CSV file (numpy)

查看:79
本文介绍了大型CSV文件(numpy)上的Python内存不足的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 3GB 的 CSV 文件,我尝试用 python 读取,我需要明智的列中值.

I have a 3GB CSV file that I try to read with python, I need the median column wise.

from numpy import * 
def data():
    return genfromtxt('All.csv',delimiter=',')

data = data() # This is where it fails already.

med = zeros(len(data[0]))
data = data.T
for i in xrange(len(data)):
    m = median(data[i])
    med[i] = 1.0/float(m)
print med

我得到的错误是这样的:

The error that I get is this:

Python(1545) malloc: *** mmap(size=16777216) failed (error code=12)

*** error: can't allocate region

*** set a breakpoint in malloc_error_break to debug

Traceback (most recent call last):

  File "Normalize.py", line 40, in <module>

  data = data()

  File "Normalize.py", line 39, in data

  return genfromtxt('All.csv',delimiter=',')

File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-
packages/numpy/lib/npyio.py", line 1495, in genfromtxt

for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):

MemoryError

我认为这只是内存不足错误.我正在运行带有 4GB 内存的 64 位 MacOSX,并且 numpy 和 Python 都以 64 位模式编译.

I think it's just an out of memory error. I am running a 64bit MacOSX with 4GB of ram and both numpy and Python compiled in 64bit mode.

我该如何解决这个问题?我应该尝试分布式方法,仅用于内存管理吗?

How do I fix this? Should I try a distributed approach, just for the memory management?

谢谢

也试过这个,但没有运气...

Also tried with this but no luck...

genfromtxt('All.csv',delimiter=',', dtype=float16)

推荐答案

正如其他人所提到的,对于一个非常大的文件,最好进行迭代.

As other folks have mentioned, for a really large file, you're better off iterating.

但是,出于各种原因,您通常确实希望将整个内容保存在内存中.

However, you do commonly want the entire thing in memory for various reasons.

genfromtxt 的效率远低于 loadtxt(尽管它处理丢失的数据,而 loadtxt 更精益求精",这是为什么这两个功能共存).

genfromtxt is much less efficient than loadtxt (though it handles missing data, whereas loadtxt is more "lean and mean", which is why the two functions co-exist).

如果您的数据非常规则(例如,只是所有相同类型的简单分隔行),您还可以通过使用 numpy.fromiter 来改进.

If your data is very regular (e.g. just simple delimited rows of all the same type), you can also improve on either by using numpy.fromiter.

如果您有足够的 ram,请考虑使用 np.loadtxt('yourfile.txt', delimiter=',')(您可能还需要指定 skiprows,如果你在文件上有一个标题.)

If you have enough ram, consider using np.loadtxt('yourfile.txt', delimiter=',') (You may also need to specify skiprows if you have a header on the file.)

作为一个快速比较,使用 loadtxt 加载约 500MB 的文本文件在使用高峰时使用约 900MB 的内存,而使用 genfromtxt 加载相同的文件使用约 2.5GB.

As a quick comparison, loading ~500MB text file with loadtxt uses ~900MB of ram at peak usage, while loading the same file with genfromtxt uses ~2.5GB.

加载txt

Genfromtxt

或者,考虑以下类似的事情.它仅适用于非常简单的常规数据,但速度非常快.(loadtxtgenfromtxt 做了很多猜测和错误检查.如果你的数据非常简单和规则,你可以大大改进.)

Alternately, consider something like the following. It will only work for very simple, regular data, but it's quite fast. (loadtxt and genfromtxt do a lot of guessing and error-checking. If your data is very simple and regular, you can improve on them greatly.)

import numpy as np

def generate_text_file(length=1e6, ncols=20):
    data = np.random.random((length, ncols))
    np.savetxt('large_text_file.csv', data, delimiter=',')

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

#generate_text_file()
data = iter_loadtxt('large_text_file.csv')

Fromiter

这篇关于大型CSV文件(numpy)上的Python内存不足的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆