使用numpy数组的内存错误 [英] Memory error utilizing numpy arrays Python

查看:241
本文介绍了使用numpy数组的内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我原来的 list _ 函数具有超过200万行代码,运行计算代码时出现内存错误.有没有办法可以解决这个问题.下面的 list _ 是实际numpy数组的一部分.

My original list_ function has over 2 million lines of code and I get a memory error when I run the code that calculates . Is there a way I could could go around it. The list_ down below isa portion fo the actual numpy array.

熊猫数据:

import pandas as pd
import math
import numpy as np
bigdata = 'input.csv'
data =pd.read_csv(Daily_url, low_memory=False)
#reverses all the table data values
data1 = data.iloc[::-1].reset_index(drop=True)
list_= np.array(data1['Close']

代码:

number = 5
list_= np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000])

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

std = np.std(rolling_window(list_, number), axis=1)

错误消息: MemoryError:无法分配198.GiB用于形状为(2659448,10000)和数据类型为float64的数组

错误消息的完整长度:

MemoryError                               Traceback (most recent call last)
<ipython-input-7-df0ab5649b16> in <module>
      5     return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
      6 
----> 7 std1 = np.std(rolling_window(PC_list, number), axis=1)

<__array_function__ internals> in std(*args, **kwargs)

C:\Python3.7\lib\site-packages\numpy\core\fromnumeric.py in std(a, axis, dtype, out, ddof, keepdims)
   3495 
   3496     return _methods._std(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
-> 3497                          **kwargs)
   3498 
   3499 

C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _std(a, axis, dtype, out, ddof, keepdims)
    232 def _std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False):
    233     ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
--> 234                keepdims=keepdims)
    235 
    236     if isinstance(ret, mu.ndarray):

C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _var(a, axis, dtype, out, ddof, keepdims)
    200     # Note that x may not be inexact and that we need it to be an array,
    201     # not a scalar.
--> 202     x = asanyarray(arr - arrmean)
    203 
    204     if issubclass(arr.dtype.type, (nt.floating, nt.integer)):

MemoryError: Unable to allocate 198. GiB for an array with shape (2659448, 10000) and data type float64

推荐答案

通常,有两种方法可以处理无法分配198GiB的内存":

Generally, there are two ways to deal with "cannot allocate 198GiB of memory":

  • 按块或逐行处理数据.

  • Process the data in chunks, or line-by line.

您的算法似乎适用于此;而不是一次读取所有数据,而是重写 rolling_window 函数,以便它加载初始窗口(文件的前 n 行),然后重复删除一行并读取文件中的一行.这样,您的内存行数永远不会超过 n 行,并且一切正常.

Your algorithm appears to be suitable for this; rather than reading the data all at once, rewrite the rolling_window function so that it loads the initial window (first n lines of the file), then repeatedly drops one line and reads one line from the file. That way, you'll never have more than n lines of memory and it'll all work fine.

如果它是本地文件,可以在整个计算过程中保持打开状态,这是最简单的.如果它是一个远程对象,您可能会发现连接超时;否则,您可能会发现连接超时.如果是这样,您可能需要将数据复制到本地文件,或者使用相关的seek/offset参数为每个额外的行(或您在本地缓冲的每个额外的块)重新打开文件.

If it's a local file, it can be kept open during the whole calculation, which is easiest. If it's a remote object, you may find connections timing out; if so, you may need to either copy the data to a local file, or use the relevant seek/offset parameter to reopen the file for each additional line (or each additional chunk, which you buffer locally).

或者,购买(租用)具有200 GiB以上内存的计算机;内存超过1 TiB的计算机可以在AWS上现成(可能是GCP和Azure;也可以直接购买).

Alternately, buy (rent) a machine with more than 200 GiB of memory; machines with over 1 TiB of memory are available off-the-shelf at AWS (and presumably GCP and Azure; or for direct purchase).

如果您可以合理地确定自己的需求不会进一步增长,而您只需要完成一项工作,则这特别适合.这样可以节省您重写代码来处理此问题的时间,但是从长远来看,这不是可持续的解决方案.

This is especially suitable if you're reasonably sure your requirements won't grow further and you just need to get this one job done. It'll save you rewriting your code to handle this, but it's not a sustainable solution in a longer term.

这篇关于使用numpy数组的内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆