numpy Loadtxt函数似乎消耗了太多内存 [英] numpy Loadtxt function seems to be consuming too much memory

查看:229
本文介绍了numpy Loadtxt函数似乎消耗了太多内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用numpy.loadtxt加载数组时,似乎占用了太多内存.例如

When I load an array using numpy.loadtxt, it seems to take too much memory. E.g.

a = numpy.zeros(int(1e6))

导致大约增加8MB的内存(使用htop,或仅增加8bytes * 100万\大约8MB).另一方面,如果我保存然后加载此数组

causes an increase of about 8MB in memory (using htop, or just 8bytes*1million \approx 8MB). On the other hand, if I save and then load this array

numpy.savetxt('a.csv', a)
b = numpy.loadtxt('a.csv')

我的内存使用量增加了约100MB!我再次用htop观察了这一点.在iPython shell中,以及在使用Pdb ++逐步执行代码时,都可以观察到这一点.

my memory usage increases by about 100MB! Again I observed this with htop. This was observed while in the iPython shell, and also while stepping through code using Pdb++.

知道这是怎么回事吗?

在阅读了jozzas的答案之后,我意识到,如果我提前知道数组的大小,那么如果说'a'是一个mxn数组,则有一种内存效率更高的处理方式:

After reading jozzas's answer, I realized that if I know ahead of time the array size, there is a much more memory efficient way to do things if say 'a' was an mxn array:

b = numpy.zeros((m,n))
with open('a.csv', 'r') as f:
    reader = csv.reader(f)
    for i, row in enumerate(reader):
        b[i,:] = numpy.array(row)

推荐答案

将此浮点数组保存到文本文件中将创建一个24M文本文件.重新加载时,numpy逐行浏览文件,解析文本并重新创建对象.

Saving this array of floats to a text file creates a 24M text file. When you re-load this, numpy goes through the file line-by-line, parsing the text and recreating the objects.

我希望在这段时间内内存使用量会激增,因为numpy在到达文件末尾之前不知道结果数组需要多大,所以我希望至少有24M +已使用8M +其他临时内存.

I would expect memory usage to spike during this time, as numpy doesn't know how big the resultant array needs to be until it gets to the end of the file, so I'd expect there to be at least 24M + 8M + other temporary memory used.

这是numpy代码的相关位,来自/lib/npyio.py:

Here's the relevant bit of the numpy code, from /lib/npyio.py:

    # Parse each line, including the first
    for i, line in enumerate(itertools.chain([first_line], fh)):
        vals = split_line(line)
        if len(vals) == 0:
            continue
        if usecols:
            vals = [vals[i] for i in usecols]
        # Convert each value according to its column and store
        items = [conv(val) for (conv, val) in zip(converters, vals)]
        # Then pack it according to the dtype's nesting
        items = pack_items(items, packing)
        X.append(items)

    #...A bit further on
    X = np.array(X, dtype)

这不是额外的内存使用问题,因为这只是python的工作方式-尽管您的python进程似乎正在使用100M内存,但它在内部维护了不再使用哪些项目的知识,并将重新-使用该内存.例如,如果要在一个程序(保存,加载,保存,加载)中重新运行此保存加载过程,则内存使用量不会增加到200M.

This additional memory usage shouldn't be a concern, as this is just the way python works - while your python process appears to be using 100M of memory, internally it maintains knowledge of which items are no longer used, and will re-use that memory. For example, if you were to re-run this save-load procedure in the one program (save, load, save, load), your memory usage will not increase to 200M.

这篇关于numpy Loadtxt函数似乎消耗了太多内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆