在python中处理非常大的数据集-内存错误 [英] Processing a very very big data set in python - memory error

查看:463
本文介绍了在python中处理非常大的数据集-内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python中的csv模块处理从csv文件获得的数据.大约有50列401125行.我使用以下代码块将数据放入列表中

I'm trying to process data obtained from a csv file using csv module in python. there are about 50 columns & 401125 rows in this. I used the following code chunk to put that data into a list

csv_file_object = csv.reader(open(r'some_path\Train.csv','rb'))
header = csv_file_object.next()
data = []
for row in csv_file_object:
    data.append(row)

我可以使用len(data)&获取此列表的长度.它返回401125.我什至可以通过调用列表索引来获取每条记录. 但是,当我尝试通过调用np.size(data)来获取列表的大小时(我将numpy导入为np),我得到了以下堆栈跟踪.

I can get length of this list using len(data) & it returns 401125. I can even get each individual record by calling list indices. But when I try to get the size of the list by calling np.size(data) (I imported numpy as np) I get the following stack trace.

MemoryError跟踪(最近一次调用) 最后)在() ----> 1 np.size(数据)

MemoryError Traceback (most recent call last) in () ----> 1 np.size(data)

C:\ Python27 \ lib \ site-packages \ numpy \ core \ fromnumeric.pyc的大小(a, 轴)2198返回a.size 2199除外 AttributeError: -> 2200 return asarray(a).size 2201 else:2202 try:

C:\Python27\lib\site-packages\numpy\core\fromnumeric.pyc in size(a, axis) 2198 return a.size 2199 except AttributeError: -> 2200 return asarray(a).size 2201 else: 2202 try:

C:\ Python27 \ lib \ site-packages \ numpy \ core \ numeric.pyc asarray(a, dtype,顺序) 233 234" -> 235返回数组(a,dtype,copy = False,order = order) 236 237 def asanyarray(a,dtype = None,order = None):

C:\Python27\lib\site-packages\numpy\core\numeric.pyc in asarray(a, dtype, order) 233 234 """ --> 235 return array(a, dtype, copy=False, order=order) 236 237 def asanyarray(a, dtype=None, order=None):

MemoryError:

MemoryError:

我什至不能使用列表索引将该列表分成多个部分,也不能将此列表转换为numpy数组.它给出了同样的内存错误.

I can't even divide that list into a multiple parts using list indices or convert this list into a numpy array. It give this same memory error.

我该如何处理这种大数据样本.还有其他方法可以处理像这样的大型数据集.

how can I deal with this kind of big data sample. Is there any other way to process large data sets like this one.

我正在Windows 7 Professional中使用ipython笔记本.

I'm using ipython notebook in windows 7 professional.

推荐答案

正如@DSM在评论中指出的那样,出现内存错误的原因是在列表上调用np.size会将数据复制到数组,然后获取大小.

As noted by @DSM in the comments, the reason you're getting a memory error is that calling np.size on a list will copy the data into an array first and then get the size.

如果您不需要将其作为numpy数组使用,只需不要调用np.size.如果您确实想要类似numpy的索引选项等等,则有一些选择.

If you don't need to work with it as a numpy array, just don't call np.size. If you do want numpy-like indexing options and so on, you have a few options.

您可以使用 pandas ,它用于处理不必要的大型数据集,并且功能强大助手和这样做的东西.

You could use pandas, which is meant for handling big not-necessarily-numerical datasets and has some great helpers and stuff for doing so.

如果您不想这样做,可以定义一个numpy 结构数组,并首先逐行填充它,而不是创建列表并复制到其中.像这样:

If you don't want to do that, you could define a numpy structure array and populate it line-by-line in the first place rather than making a list and copying into it. Something like:

fields = [('name1', str), ('name2', float), ...]
data = np.zeros((num_rows,), dtype=fields)

csv_file_object = csv.reader(open(r'some_path\Train.csv','rb'))
header = csv_file_object.next()
for i, row in enumerate(csv_file_object):
    data[i] = row

您还可以基于header定义fields,因此您不必手动输入所有50个列名,尽管您必须为指定每个列的数据类型做些事情.

You could also define fields based on header so you don't have to manually type out all 50 column names, though you'd have to do something about specifying the data types for each.

这篇关于在python中处理非常大的数据集-内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆