文件数据到阵列占用大量内存 [英] File data to array is using a lot of memory

查看:136
本文介绍了文件数据到阵列占用大量内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要提取一个带有制表符分隔值的大文本文件,并将其添加到数组中.

I'm taking a large text file with tab separated values and adding them to an array.

当我在32 Mb的文件上运行代码时,python的内存消耗将不断增加.使用大约500 Mb的RAM.

When I run my code on a 32 Mb file, python memory consumption goes through the roof; using around 500 Mb RAM.

我需要能够为2 GB文件甚至更大的文件运行此代码.

I need to be able to run this code for a 2 GB file, and possibly even larger files.

我当前的代码是:

markers = []

def parseZeroIndex():
    with open('chromosomedata') as zeroIndexes:
        for line in zeroIndexes:
            markers.append(line.split('\t'))

parseZeroIndex()

无法对我的2 GB文件运行此代码.这些文件如下所示:

Running this code against my 2 GB file is not possible as is. The files look like this:

per1    1029292 string1 euqye
per1    1029292 string2 euqys

我的问题是:

所有这些内存正在使用什么?

What is using all this memory?

明智的记忆方式是什么?

What is a more efficient way to do this memory wise?

推荐答案

什么在使用所有这些内存?"

Python对象的开销很大.查看某些字符串实际占用多少字节:

"What is using all this memory?"

There's overhead for Python objects. See how many bytes some strings actually take:

Python 2:

>>> import sys
>>> map(sys.getsizeof, ('', 'a', u'ä'))
[21, 22, 28]

Python 3:

>>> import sys
>>> list(map(sys.getsizeof, ('', 'a', 'ä')))
[25, 26, 38]


明智的记忆方式是什么?"

在注释中,您说有很多重复的值,因此 字符串实习 (每个不同的字符串值仅存储一个副本)可能会很有帮助.试试这个:


"What is a more efficient way to do this memory wise?"

In comments you said there are lots of duplicate values, so string interning (storing only one copy of each distinct string value) might help a lot. Try this:

Python 2:

            markers.append(map(intern, line.rstrip().split('\t')))

Python 3:

            markers.append(list(map(sys.intern, line.rstrip().split('\t'))))

请注意,我还使用line.rstrip()从行中删除了结尾的\n.

Note I also used line.rstrip() to remove the trailing \n from the line.

实验

我尝试过

>>> x = [str(i % 1000) for i in range(10**7)]

>>> import sys
>>> x = [sys.intern(str(i % 1000)) for i in range(10**7)]

在Python 3中.第一个占用355 MB(查看Windows Task Manager中的过程).第二个仅占用47 MB​​.此外:

in Python 3. The first one takes 355 MB (looking at the process in Windows Task Manager). The second one takes only 47 MB. Furthermore:

>>> sys.getsizeof(x)
40764032
>>> sum(map(sys.getsizeof, x[:1000]))
27890

因此40 MB用于引用字符串的列表(不足为奇,因为有1000万个引用,每个字节有4个字节).字符串本身总共只有27个 K B.

So 40 MB is for the list referencing the strings (no surprise, as there are ten million references of four bytes each). And the strings themselves total only 27 KB.

进一步的改进

从实验中可以看出,您的大部分RAM使用情况可能不是来自字符串,而是来自列表对象. markers列表对象以及代表行的所有那些列表对象.尤其是如果您使用的是64位Python,我怀疑您是这样做的.

As seen in the experiment, much of your RAM usage might be not from the strings but from your list object(s). Both your markers list object as well as all those list objects representing your rows. Especially if you're using 64-bit Python, which I suspect you do.

为减少开销,您可以使用元组代替行列表,因为它们更轻巧:

To reduce that overhead, you could use tuples instead of lists for your rows, as they're more light-weight:

sys.getsizeof(['a', 'b', 'c'])
48
>>> sys.getsizeof(('a', 'b', 'c'))
40

我估计您的2 GB文件有8000万行,这样可以节省640 MB RAM.如果您运行64位Python,则可能会更多.

I estimate your 2 GB file has 80 million rows, so that would save 640 MB RAM. Perhaps more if you run 64-bit Python.

另一个想法:如果所有行都具有相同数量的值(我假设为三个),则可以放弃这8000万行列表对象,而使用2.4亿个字符串值的一维列表.您只需要使用markers[3*i+j]而不是markers[i][j]进行访问.而且可以节省几GB RAM.

Another idea: If all your rows have the same number of values (I assume three), then you could ditch those 80 million row list objects and use a one-dimensional list of the 240 million string values instead. You'd just have to access it with markers[3*i+j] instead of markers[i][j]. And it could save a few GB RAM.

这篇关于文件数据到阵列占用大量内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆