在np.loadtxt和iter_loadtxt中的Python MemoryError或ValueError [英] Python MemoryError or ValueError in np.loadtxt and iter_loadtxt

查看:431
本文介绍了在np.loadtxt和iter_loadtxt中的Python MemoryError或ValueError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的出发点是NumPy的函数loadtxt有问题:

  X = np.loadtxt(filename,delimiter =, )

MemoryError code> np.loadtxt(..)。我GOOGLE了它来到这个问题在StackOverflow 。这给了以下解决方案:

$ $ p $ def iter_loadtxt(filename,delimiter =',',skiprows = 0,dtype = float):
def iter_func():
打开(文件名,'r')作为infile:
作为范围(skiprows):
next(infile)
for line in infile:
line = line.rstrip()。split(delimiter)
for line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)

data = np.fromiter(iter_func(),dtype = dtype)
data = data.reshape(( - 1,iter_loadtxt.rowlength))
返回数据

data = iter_loadtxt('your_file.ext')

所以我尝试过然后遇到以下错误信息:

 > data = data.reshape(( -  1,iter_loadtext.rowlength))
> ValueError:新数组的总大小必须保持不变

然后我尝试添加行数和最大值cols到代码片段的下面这里,我从另一个问题部分地得到 和部分写自己:

  num_rows = 0 
max_cols = 0
with open文件名,'r')作为infile:
在infile中的行:
num_rows + = 1
tmp = line.split(,)
if len(tmp)> ; max_cols:
max_cols = len(tmp)
$ b $ def iter_func():
#没有改变

data = np.fromiter(iter_func ),dtype = dtype,count = num_rows)
data = data.reshape((num_rows,max_cols))

但是,这仍然给了同样的错误信息,但我认为应该已经解决了。另一方面,我不确定是否以正确的方式调用 data.reshape(..)

我评论了 date.reshape(..)被调用以查看发生了什么的规则。这给了这个错误消息:

 > ValueError:需要多个值才能解包

发生在第一个点的地方, code> X ,这个问题全是关于变量。



我知道这个代码可以处理输入文件,因为我看到它在与他们一起使用。但我找不到为什么我不能解决这个问题。我的理由是,因为我使用的是32位Python版本(在64位Windows机器上),所以在其他计算机上没有发生内存错误。但我不确定。对于信息:我有一个1.2GB的文件8GB的内存,但我的内存不满,根据任务管理器。

我想要解决的是我使用的是需要读取和解析给定文件的开源代码,就像 np.loadtxt文件名,delimiter =,),但在我的记忆。我知道最初在MacOsx和Linux上工作的代码,更准确地说:MacOsx 10.9.2和Linux(版本2.6.18-194.26.1.el5(brewbuilder@norob.fnal.gov)(gcc version 4.1.2 20080704(Red Hat 4.1.2-48))1 SMP Tue Nov Nov 9 12:46:16 EST 2010)



我不在乎时间。我的文件包含+200.000行,每行有100或1000行(取决于输入文件:总是100,总是1000)项目,其中一个项目是具有3位小数的浮点或者否定和它们由和一个空格分开。 Fe: [..] 0.194,-0.007,0.004,0.243,[..] ,100或100您看到4的项目,为+ -200.000



我使用的是Python 2.7,因为开放源代码需要。



有这个解决方案?在Windows上,一个32位的进程最多只有2GB(或GiB?)的内存,而 numpy.loadtxt 因为内存太重而臭名昭着,所以这就解释了为什么第一种方法不行。

您看起来面临的第二个问题是您正在测试的特定文件缺少数据,即并非所有行都具有相同数量的值。例如:

  import numpy as np 

numbers_per_line = []
打开(文件名)作为infile:
在infile中的行:
numbers_per_line.append(line.count(分隔符)+ 1)

#检查哪里可能be problems
numbers_per_line = np.array(numbers_per_line)
expected_number = 100
print np.where(numbers_per_line!= expected_number)


My starting point was a problem with NumPy's function loadtxt:

X = np.loadtxt(filename, delimiter=",")

that gave a MemoryError in np.loadtxt(..). I googled it and came to this question on StackOverflow. That gave the following solution:

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

data = iter_loadtxt('your_file.ext')

So I tried that, but then encountered the following error message:

> data = data.reshape((-1, iter_loadtext.rowlength))
> ValueError: total size of new array must be unchanged

Then I tried to add the number of rows and maximum number of cols to the code with the code fragments down here, which I partly got from another question and partly wrote myself:

num_rows = 0
max_cols = 0
with open(filename, 'r') as infile:
    for line in infile:
        num_rows += 1
        tmp = line.split(",")
        if len(tmp) > max_cols:
            max_cols = len(tmp)

def iter_func():
    #didn't change

data = np.fromiter(iter_func(), dtype=dtype, count=num_rows)
data = data.reshape((num_rows, max_cols))

But this still gave the same error message though I thought it should have been solved. On the other hand I'm not sure if I'm calling data.reshape(..) in the correct manner.

I commented the rule where date.reshape(..) is called to see what happened. That gave this error message:

> ValueError: need more than 1 value to unpack

Which happened at the first point where something is done with X, the variable where this problem is all about.

I know this code can work on the input files I got, because I saw it in use with them. But I can't find why I can't solve this problem. My reasoning goes as far as that because I'm using a 32-bit Python version (on a 64-bit Windows machine), something goes wrong with memory that doesn't happen on other computers. But I'm not sure. For info: I'm having 8GB of RAM for a 1.2GB file but my RAM is not full according to Task Manager.

What I want to solve is that I'm using open source code that needs to read and parse the given file just like np.loadtxt(filename, delimiter=","), but then within my memory. I know the code originally worked in MacOsx and Linux, and to be more precise: "MacOsx 10.9.2 and Linux (version 2.6.18-194.26.1.el5 (brewbuilder@norob.fnal.gov) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) 1 SMP Tue Nov 9 12:46:16 EST 2010)."

I don't care that much about time. My file contains +-200.000 lines on which there are 100 or 1000 (depending on the input files: one is always 100, one is always 1000) items per line, where one item is a floating point with 3 decimals either negated or not and they are separated by , and a space. F.e.: [..] 0.194, -0.007, 0.004, 0.243, [..], and 100 or 100 of those items of which you see 4, for +-200.000 lines.

I'm using Python 2.7 because the open source code needs that.

Does any of you have the solution for this? Thanks in advance.

解决方案

On Windows a 32 bit process is only given a maximum of 2GB (or GiB?) memory and numpy.loadtxt is notorious for being heavy on memory, so that explains why the first approach doesn't work.

The second problem you appear to be facing is that the particular file you are testing with has missing data, i.e. not all lines have the same number of values. This is easy to check, for example:

import numpy as np

numbers_per_line = []
with open(filename) as infile:
    for line in infile:
        numbers_per_line.append(line.count(delimiter) + 1)

# Check where there might be problems
numbers_per_line = np.array(numbers_per_line)
expected_number = 100
print np.where(numbers_per_line != expected_number)

这篇关于在np.loadtxt和iter_loadtxt中的Python MemoryError或ValueError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆