Python大内存中的CSV文件(numpy) [英] Python out of memory on large CSV file (numpy)

查看:935
本文介绍了Python大内存中的CSV文件(numpy)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个3GB的CSV文件,我尝试用python读取,我需要中间列明智。

 从numpy import * 
def data():
return genfromtxt('All.csv',delimiter =',')

data = data 。

med = zeros(len(data [0]))
data = data.T
for i in xrange(len(data)):
m = median (data [i])
med [i] = 1.0 / float(m)
print med


b $ b

我得到的错误是:

  Python(1545)malloc:*** mmap = 16777216)失败(错误代码= 12)

***错误:无法分配区域

***在malloc_error_break中设置断点以调试

Traceback(最近一次调用):

在< module>中的文件Normalize.py,第40行。

data = data()

文件Normalize.py,第39行,数据

返回genfromtxt('All.csv' delimiter =',')

File/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-
packages / numpy / lib / npyio.py ,line 1495,in genfromtxt

for(i,line)in enumerate(itertools.chain([first_line,],fhd)):

MemoryError

我认为这只是一个内存不足错误。我运行一个64位MacOSX与4GB的ram,numpy和Python编译64位模式。



如何解决这个问题?我应该尝试一种分布式方法,只是为了内存管理?



感谢



编辑:也尝试过,但没有运气...

  genfromtxt('All.csv',delimiter =',',dtype = float16)


解决方案

正如其他人提到的,对于一个非常大的文件,你最好是迭代。

$ b $



genfromtxt $ b

但是, c>比 loadtxt (尽管它处理丢失的数据,而 loadtxt 更加精益和平均 ,这就是为什么两个函数共存)。



如果你的数据是非常规则的(例如,只是简单的分隔行的所有相同类型),你还可以改进如果你有足够的ram,请考虑使用 np.loadtxt('yourfile.txt',delimiter =',')(如果您有一个标题,您可能还需要指定 skiprows



作为快速比较,使用 loadtxt 加载〜500MB的文本文件使用〜900MB的RAM高峰用量,而使用 genfromtxt 加载相同文件时使用〜2.5GB。



Loadtxt






Genfromtxt






或者,考虑类似以下内容。它只适用于非常简单,常规的数据,但它是相当快。 ( loadtxt genfromtxt 做很多猜测和错误检查如果你的数据很简单,可以大幅改善它们。)

  import numpy as np 

def generate_text_file(length = 1e6, ncols = 20):
data = np.random.random((length,ncols))
np.savetxt('large_text_file.csv',data,delimiter =',')

def iter_loadtxt(filename,delimiter =',',skiprows = 0,dtype = float):
def iter_func():
with open(filename,'r')as infile:
for _ in range(skiprows):
next(infile)
for line in infile:
line = line.rstrip()。split(delimiter)
line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)

data = np.fromiter(iter_func(),dtype = dtype)
data = data.reshape(( - 1,iter_loadtxt.rowlength))
返回数据

#generate_text_file()
data = iter_loadtxt('large_text_file.csv')

Fromiter




I have a 3GB CSV file that I try to read with python, I need the median column wise.

from numpy import * 
def data():
    return genfromtxt('All.csv',delimiter=',')

data = data() # This is where it fails already.

med = zeros(len(data[0]))
data = data.T
for i in xrange(len(data)):
    m = median(data[i])
    med[i] = 1.0/float(m)
print med

The error that I get is this:

Python(1545) malloc: *** mmap(size=16777216) failed (error code=12)

*** error: can't allocate region

*** set a breakpoint in malloc_error_break to debug

Traceback (most recent call last):

  File "Normalize.py", line 40, in <module>

  data = data()

  File "Normalize.py", line 39, in data

  return genfromtxt('All.csv',delimiter=',')

File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-
packages/numpy/lib/npyio.py", line 1495, in genfromtxt

for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):

MemoryError

I think it's just an out of memory error. I am running a 64bit MacOSX with 4GB of ram and both numpy and Python compiled in 64bit mode.

How do I fix this? Should I try a distributed approach, just for the memory management?

Thanks

EDIT: Also tried with this but no luck...

genfromtxt('All.csv',delimiter=',', dtype=float16)

解决方案

As other folks have mentioned, for a really large file, you're better off iterating.

However, you do commonly want the entire thing in memory for various reasons.

genfromtxt is much less efficient than loadtxt (though it handles missing data, whereas loadtxt is more "lean and mean", which is why the two functions co-exist).

If your data is very regular (e.g. just simple delimited rows of all the same type), you can also improve on either by using numpy.fromiter.

If you have enough ram, consider using np.loadtxt('yourfile.txt', delimiter=',') (You may also need to specify skiprows if you have a header on the file.)

As a quick comparison, loading ~500MB text file with loadtxt uses ~900MB of ram at peak usage, while loading the same file with genfromtxt uses ~2.5GB.

Loadtxt


Genfromtxt


Alternately, consider something like the following. It will only work for very simple, regular data, but it's quite fast. (loadtxt and genfromtxt do a lot of guessing and error-checking. If your data is very simple and regular, you can improve on them greatly.)

import numpy as np

def generate_text_file(length=1e6, ncols=20):
    data = np.random.random((length, ncols))
    np.savetxt('large_text_file.csv', data, delimiter=',')

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

#generate_text_file()
data = iter_loadtxt('large_text_file.csv')

Fromiter

这篇关于Python大内存中的CSV文件(numpy)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆