时间序列数据预处理-numpy迈步以节省内存 [英] Time series data preprocessing - numpy strides trick to save memory
问题描述
我正在预处理一个时间序列数据集,将其形状从二维(数据点,要素)更改为三维(数据点,时间窗,要素)。
I am preprocessing a timeseries dataset changing its shape from 2-dimensions (datapoints, features) into a 3-dimensions (datapoints, time_window, features).
在这样的透视时间窗口(有时也称为回溯)指示作为输入变量所涉及的先前时间步长/数据点的数量,以预测下一时间段。换句话说,时间窗口是机器学习算法过去要考虑的数据量,以供将来进行单个预测时使用。
In such perspective time windows (sometimes also called look back) indicates the number of previous time steps/datapoints that are involved as input variables to predict the next time period. In other words time windows is how much data in past the machine learning algorithm takes into consideration for a single prediction in the future.
这种方法(或至少在我的实现中)的问题在于,由于在整个窗口中带来数据冗余,从而导致输入,因此在内存使用方面效率很低数据变得非常繁琐。
The issue with such approach (or at least with my implementation) is that it is quite inefficient in terms of memory usage since it brings data redundancy across the windows causing the input data to become very heavy.
这是我到目前为止一直在使用的功能,可以将输入数据重塑为3维结构。
This is the function that I have been using so far to reshape the input data into a 3 dimensional structure.
from sys import getsizeof
def time_framer(data_to_frame, window_size=1):
"""It transforms a 2d dataset into 3d based on a specific size;
original function can be found at:
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
"""
n_datapoints = data_to_frame.shape[0] - window_size
framed_data = np.empty(
shape=(n_datapoints, window_size, data_to_frame.shape[1],)).astype(np.float32)
for index in range(n_datapoints):
framed_data[index] = data_to_frame[index:(index + window_size)]
print(framed_data.shape)
# it prints the size of the output in MB
print(framed_data.nbytes / 10 ** 6)
print(getsizeof(framed_data) / 10 ** 6)
# quick and dirty quality test to check if the data has been correctly reshaped
test1=list(set(framed_data[0][1]==framed_data[1][0]))
if test1[0] and len(test1)==1:
print('Data is correctly framed')
return framed_data
建议使用 numpy的步幅技巧来解决此问题并减少重新整理的数据的大小。不幸的是,到目前为止,我在该主题上找到的所有资源都集中在二维数组上的实现上,就像出色的教程。我一直在努力解决涉及3维输出的用例。这是我表现出来的最好的。但是,它不能成功地减小framed_data的大小,也不能正确地对数据进行构图,因为它没有通过质量测试。
I have been suggested to use numpy's strides trick to overcome such problem and reduce the size of the reshaped data. Unfortunately, any resource I found so far on this subject is focused on implementing the trick on a 2 dimensional array, just as this excellent tutorial. I have been struggling with my use case which involves a 3 dimensional output. Here is the best I came out with; however, it neither succeeds in reducing the size of the framed_data, nor it frames the data correctly as it does not pass the quality test.
我很确定我的错误我没有完全理解 strides 参数。 new_strides 是我成功成功馈给 as_strided 的唯一值。
I am quite sure that my error is on the strides parameter which I did not fully understood. The new_strides are the only values I managed to successfully feed to as_strided.
from numpy.lib.stride_tricks import as_strided
def strides_trick_time_framer(data_to_frame, window_size=1):
new_strides = (data_to_frame.strides[0],
data_to_frame.strides[0]*data_to_frame.shape[1] ,
data_to_frame.strides[0]*window_size)
n_datapoints = data_to_frame.shape[0] - window_size
print('striding.....')
framed_data = as_strided(data_to_frame,
shape=(n_datapoints, # .flatten() here did not change the outcome
window_size,
data_to_frame.shape[1]),
strides=new_strides).astype(np.float32)
# it prints the size of the output in MB
print(framed_data.nbytes / 10 ** 6)
print(getsizeof(framed_data) / 10 ** 6)
# quick and dirty test to check if the data has been correctly reshaped
test1=list(set(framed_data[0][1]==framed_data[1][0]))
if test1[0] and len(test1)==1:
print('Data is correctly framed')
return framed_data
任何帮助将不胜感激!
推荐答案
为此 X
:
In [734]: X = np.arange(24).reshape(8,3)
In [735]: X.strides
Out[735]: (24, 8)
这个迷恋
在[736]中生成与 time_framer
this as_strided
produces the same array as your time_framer
In [736]: np.lib.stride_tricks.as_strided(X,
shape=(X.shape[0]-3, 3, X.shape[1]),
strides=(24, 24, 8))
Out[736]:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]],
[[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[12, 13, 14],
[15, 16, 17],
[18, 19, 20]]])
它跨越了最后一个维度,就像 X
。还有第二到最后。第一个前进一行,因此它也得到 X.strides [0]
。因此,窗口大小仅影响形状,而不影响步幅。
It strides the last dimension just like X
. And 2nd to the last as well. The first advances one row, so it too gets X.strides[0]
. So the window size only affects the shape, not the strides.
因此,在您的 as_strided
版本中,只需使用:
So in your as_strided
version just use:
new_strides = (data_to_frame.strides[0],
data_to_frame.strides[0] ,
data_to_frame.strides[1])
较小的校正。将默认窗口大小设置为2或更大。 1在测试中产生索引错误。
Minor corrections. Set the default window size to 2 or larger. 1 produces an indexing error in the test.
framed_data[0,1]==framed_data[1,0]
查找 getsizeof
:
In [754]: sys.getsizeof(X)
Out[754]: 112
In [755]: X.nbytes
Out[755]: 192
等等,为什么 X
的大小小于 nbytes
?因为这是一个视图
(请参见上面的[734]行。)
Wait, why is X
size smaller than nbytes
? Because it is a view
(see line [734] above).
In [756]: sys.getsizeof(X.copy())
Out[756]: 304
如另一个SO中所述,必须谨慎使用 getsizeof
:
As noted in another SO, getsizeof
has to be used with caution:
现在是展开后的副本:
In [757]: x2=time_framer(X,4)
...
In [758]: x2.strides
Out[758]: (96, 24, 8)
In [759]: x2.nbytes
Out[759]: 384
In [760]: sys.getsizeof(x2)
Out[760]: 512
和跨步版本
In [761]: x1=strides_trick_time_framer(X,4)
...
In [762]: x1.strides
Out[762]: (24, 24, 8)
In [763]: sys.getsizeof(x1)
Out[763]: 128
In [764]: x1.astype(int).strides
Out[764]: (96, 24, 8)
In [765]: sys.getsizeof(x1.astype(int))
Out[765]: 512
x1
的大小就像一个视图(128是3d)。但是,如果我们尝试更改其 dtype
,它将进行复制,并且步幅和大小与 x2
相同。
x1
size is just like a view (128 because its 3d). But if we try to change its dtype
, it makes a copy, and the strides and size are the same as x2
.
在 x1
上进行的许多操作都将失去巨大的规模优势, x1.ravel( )
, x1 + 1
等。主要是归约运算,例如平均值
和 sum
可以节省大量空间。
Many operations on x1
will loose the strided size advantage, x1.ravel()
, x1+1
etc. It's mainly reduction operations like mean
and sum
that produce a real space savings.
这篇关于时间序列数据预处理-numpy迈步以节省内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!