时间序列数据预处理-numpy迈步以节省内存 [英] Time series data preprocessing - numpy strides trick to save memory

查看:86
本文介绍了时间序列数据预处理-numpy迈步以节省内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在预处理一个时间序列数据集,将其形状从二维(数据点,要素)更改为三维(数据点,时间窗,要素)。

I am preprocessing a timeseries dataset changing its shape from 2-dimensions (datapoints, features) into a 3-dimensions (datapoints, time_window, features).

在这样的透视时间窗口(有时也称为回溯)指示作为输入变量所涉及的先前时间步长/数据点的数量,以预测下一时间段。换句话说,时间窗口是机器学习算法过去要考虑的数据量,以供将来进行单个预测时使用。

In such perspective time windows (sometimes also called look back) indicates the number of previous time steps/datapoints that are involved as input variables to predict the next time period. In other words time windows is how much data in past the machine learning algorithm takes into consideration for a single prediction in the future.

这种方法(或至少在我的实现中)的问题在于,由于在整个窗口中带来数据冗余,从而导致输入,因此在内存使用方面效率很低数据变得非常繁琐。

The issue with such approach (or at least with my implementation) is that it is quite inefficient in terms of memory usage since it brings data redundancy across the windows causing the input data to become very heavy.

这是我到目前为止一直在使用的功能,可以将输入数据重塑为3维结构。

This is the function that I have been using so far to reshape the input data into a 3 dimensional structure.

from sys import getsizeof

def time_framer(data_to_frame, window_size=1):
    """It transforms a 2d dataset into 3d based on a specific size;
    original function can be found at:
    https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
    """
    n_datapoints = data_to_frame.shape[0] - window_size
    framed_data = np.empty(
        shape=(n_datapoints, window_size, data_to_frame.shape[1],)).astype(np.float32)

    for index in range(n_datapoints):
        framed_data[index] = data_to_frame[index:(index + window_size)]
        print(framed_data.shape)

    # it prints the size of the output in MB
    print(framed_data.nbytes / 10 ** 6)
    print(getsizeof(framed_data) / 10 ** 6)

    # quick and dirty quality test to check if the data has been correctly reshaped        
    test1=list(set(framed_data[0][1]==framed_data[1][0]))
    if test1[0] and len(test1)==1:
        print('Data is correctly framed')

    return framed_data

建议使用 numpy的步幅技巧来解决此问题并减少重新整理的数据的大小。不幸的是,到目前为止,我在该主题上找到的所有资源都集中在二维数组上的实现上,就像出色的教程。我一直在努力解决涉及3维输出的用例。这是我表现出来的最好的。但是,它不能成功地减小framed_data的大小,也不能正确地对数据进行构图,因为它没有通过质量测试。

I have been suggested to use numpy's strides trick to overcome such problem and reduce the size of the reshaped data. Unfortunately, any resource I found so far on this subject is focused on implementing the trick on a 2 dimensional array, just as this excellent tutorial. I have been struggling with my use case which involves a 3 dimensional output. Here is the best I came out with; however, it neither succeeds in reducing the size of the framed_data, nor it frames the data correctly as it does not pass the quality test.

我很确定我的错误我没有完全理解 strides 参数。 new_strides 是我成功成功馈给 as_strided 的唯一值。

I am quite sure that my error is on the strides parameter which I did not fully understood. The new_strides are the only values I managed to successfully feed to as_strided.

from numpy.lib.stride_tricks import as_strided

def strides_trick_time_framer(data_to_frame, window_size=1):

    new_strides = (data_to_frame.strides[0],
                   data_to_frame.strides[0]*data_to_frame.shape[1] ,
                   data_to_frame.strides[0]*window_size)

    n_datapoints = data_to_frame.shape[0] - window_size
    print('striding.....')
    framed_data = as_strided(data_to_frame, 
                             shape=(n_datapoints, # .flatten() here did not change the outcome
                                    window_size,
                                    data_to_frame.shape[1]),                   
                                    strides=new_strides).astype(np.float32)
    # it prints the size of the output in MB
    print(framed_data.nbytes / 10 ** 6)
    print(getsizeof(framed_data) / 10 ** 6)

    # quick and dirty test to check if the data has been correctly reshaped        
    test1=list(set(framed_data[0][1]==framed_data[1][0]))
    if test1[0] and len(test1)==1:
        print('Data is correctly framed')

    return framed_data

任何帮助将不胜感激!

推荐答案

为此 X

In [734]: X = np.arange(24).reshape(8,3)
In [735]: X.strides
Out[735]: (24, 8)

这个迷恋在[736]中生成与 time_framer

this as_strided produces the same array as your time_framer

In [736]: np.lib.stride_tricks.as_strided(X, 
            shape=(X.shape[0]-3, 3, X.shape[1]), 
            strides=(24, 24, 8))
Out[736]: 
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 3,  4,  5],
        [ 6,  7,  8],
        [ 9, 10, 11]],

       [[ 6,  7,  8],
        [ 9, 10, 11],
        [12, 13, 14]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[12, 13, 14],
        [15, 16, 17],
        [18, 19, 20]]])

它跨越了最后一个维度,就像 X 。还有第二到最后。第一个前进一行,因此它也得到 X.strides [0] 。因此,窗口大小仅影响形状,而不影响步幅。

It strides the last dimension just like X. And 2nd to the last as well. The first advances one row, so it too gets X.strides[0]. So the window size only affects the shape, not the strides.

因此,在您的 as_strided 版本中,只需使用:

So in your as_strided version just use:

 new_strides = (data_to_frame.strides[0],
                data_to_frame.strides[0] ,
                data_to_frame.strides[1])






较小的校正。将默认窗口大小设置为2或更大。 1在测试中产生索引错误。


Minor corrections. Set the default window size to 2 or larger. 1 produces an indexing error in the test.

framed_data[0,1]==framed_data[1,0]






查找 getsizeof

In [754]: sys.getsizeof(X)
Out[754]: 112
In [755]: X.nbytes
Out[755]: 192

等等,为什么 X 的大小小于 nbytes ?因为这是一个视图(请参见上面的[734]行。)

Wait, why is X size smaller than nbytes? Because it is a view (see line [734] above).

In [756]: sys.getsizeof(X.copy())
Out[756]: 304

如另一个SO中所述,必须谨慎使用 getsizeof

As noted in another SO, getsizeof has to be used with caution:

为什么numpy数组的大小不同?

现在是展开后的副本:

In [757]: x2=time_framer(X,4)
...
In [758]: x2.strides
Out[758]: (96, 24, 8)
In [759]: x2.nbytes
Out[759]: 384
In [760]: sys.getsizeof(x2)
Out[760]: 512

和跨步版本

In [761]: x1=strides_trick_time_framer(X,4)
...
In [762]: x1.strides
Out[762]: (24, 24, 8)
In [763]: sys.getsizeof(x1)
Out[763]: 128
In [764]: x1.astype(int).strides
Out[764]: (96, 24, 8)
In [765]: sys.getsizeof(x1.astype(int))
Out[765]: 512

x1 的大小就像一个视图(128是3d)。但是,如果我们尝试更改其 dtype ,它将进行复制,并且步幅和大小与 x2 相同。

x1 size is just like a view (128 because its 3d). But if we try to change its dtype, it makes a copy, and the strides and size are the same as x2.

x1 上进行的许多操作都将失去巨大的规模优势, x1.ravel( ) x1 + 1 等。主要是归约运算,例如平均值 sum 可以节省大量空间。

Many operations on x1 will loose the strided size advantage, x1.ravel(), x1+1 etc. It's mainly reduction operations like mean and sum that produce a real space savings.

这篇关于时间序列数据预处理-numpy迈步以节省内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆