将时间序列数据提供给有状态 LSTM 的正确方法? [英] Proper way to feed time-series data to stateful LSTM?

查看:34
本文介绍了将时间序列数据提供给有状态 LSTM 的正确方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个整数序列:

0,1,2, ..

并希望在给定最后 3 个整数的情况下预测下一个整数,例如:

[0,1,2]->5[3,4,5]->6

假设我像这样设置模型:

batch_size=1time_steps=3模型 = 顺序()model.add(LSTM(4,batch_input_shape=(batch_size, time_steps, 1), stateful=True))模型.添加(密集(1))

据我所知,模型具有以下结构(请原谅草图):

第一个问题:我的理解是否正确?

注意我已经绘制了之前的状态 C_{t-1}, h_{t-1} 进入图片,因为这是在指定 stateful=True 时暴露的.在这个简单的下一个整数预测"问题中,应该通过提供这些额外信息来提高性能(只要前一个状态来自前 3 个整数).

这让我想到了我的主要问题:这似乎是标准做法(例如,参见 :

<块引用>

有状态:布尔值(默认为 False).如果为 True,则每个的最后一个状态批次中索引 i 处的样本将用作初始状态下一批中索引 i 的样本.

似乎这个内部"状态不可用,所有可用的是最终状态.看这个图:

所以,如果我的理解是正确的(这显然不是),我们是否应该在使用 stateful=True 时向模型提供非重叠的样本窗口?例如:

batch0: [[0, 1, 2]]批次 1:[[3, 4, 5]]批次 2:[[6, 7, 8]]等等

解决方案

答案是:取决于手头的问题.对于您的一步预测的情况 - 是的,您可以,但您不必.但是,无论您是否这样做,都会对学习产生重大影响.

<小时>

批量与样本机制(参见 AI" = 参见附加信息"部分)

所有模型都将样本视为独立示例;一批 32 个样本就像一次喂 1 个样本,32 次(有差异 - 参见 AI).从模型的角度来看,数据被拆分为批量维度,batch_shape[0] 和特征维度,batch_shape[1:] - 这两个不说话."两者之间唯一的关系是通过梯度(参见 AI).

<小时>

重叠与不重叠批次

也许理解它的最佳方法是基于信息.我将从时间序列二元分类开始,然后将其与预测联系起来:假设您有 10 分钟的 EEG 记录,每个记录有 240000 个时间步长.任务:癫痫发作还是非癫痫发作?

  • 由于 240k 对于 RNN 来说太多了,我们使用 CNN 进行降维
  • 我们可以选择使用滑动窗口"——即一次提供一个子段;让我们使用 54k

取 10 个样本,形状 (240000, 1).怎么喂?

  1. (10, 54000, 1),包含所有样本,切片为sample[0:54000];样本[54000:108000] ...
  2. (10, 54000, 1),包含所有样本,切片为sample[0:54000];样本[1:54001] ...

以上两个你选哪个?如果 (2),您的神经网络永远不会将这 10 个样本的癫痫发作与非癫痫发作混淆.但它对任何其他样本也一无所知.也就是说,它会大量过拟合,因为它每次迭代看到的信息几乎没有什么不同(1/54000 = 0.0019%)——所以你基本上是在给它喂食同一批次 连续多次.现在假设(3):

  1. (10, 54000, 1),包含所有样本,切片为sample[0:54000];样本[24000:81000] ...

合理得多;现在我们的窗口有 50% 的重叠,而不是 99.998%.

<小时>

预测:重叠不好?

如果您正在进行一步预测,那么信息格局现在已发生变化:

  • 很有可能,您的序列长度与 240000 相差甚远,因此任何类型的重叠都不会受到多次相同批次"的影响
  • 预测与分类的根本区别在于,您提供的每个子样本的标签(下一个时间步)都不同;分类对整个序列使用一个

这极大地改变了您的损失函数,以及最小化损失函数的良好做法":

  • 预测器必须对其初始样本具有鲁棒性,尤其是对于 LSTM - 所以我们通过滑动序列来训练每个这样的开始",如您所示
  • 由于标签随时间步长不同,损失函数随时间步长发生显着变化,因此过拟合的风险要小得多
<小时>

我该怎么办?

首先,请确保您理解整篇文章,因为这里没有真正可选"的内容.然后,这是关于重叠与不重叠的关键,每批:

  1. 一个样本移位:模型学习更好地预测每一个起始步骤的前一步——意思是:(1) LSTM 对初始细胞状态的鲁棒性;(2) 给定 X 步后,LSTM 可以很好地预测任何领先的步
  2. 许多样本,稍后批次:模型不太可能记住"训练集和过度拟合

您的目标:平衡两者;1 比 2 的主要优势是:

  • 2 可以让模型忘记看到的样本
  • 1 允许模型通过检查多个起点和终点(标签)的样本并相应地平均梯度来提取质量更好的特征

我应该在预测中使用 (2) 吗?

  • 如果您的序列长度很长,并且您可以负担得起其长度约 50% 的滑动窗口",也许,但取决于数据的性质:信号 (EEG)?是的.股票,天气?怀疑它.
  • 多对多预测;更常见的是 (2),在每个更长的序列中.
<小时>

LSTM stateful:实际上可能对您的问题完全无用.

当 LSTM 无法一次处理整个序列时使用有状态,因此它被拆分" - 或者当反向传播需要不同的梯度时.对于前者,想法是 - LSTM 在评估后者时考虑前者:

  • t0=seq[0:50];t1=seq[50:100] 有意义;t0 逻辑上导致 t1
  • seq[0:50] -->seq[1:51] 没有意义;t1 不是从 t0
  • 因果推导出来的

换句话说:不要在单独的批次中有状态重叠.同一批是可以的,同样,独立性 - 样本之间没有状态".

何时使用有状态:当 LSTM 受益于在评估下一个批次时考虑前一个批次.这可以包括一步预测,但前提是您不能一次提供整个序列:

  • 期望:100 个时间步长.可以做: 50. 所以我们设置 t0, t1 如上面的第一个项目符号.
  • 问题:以编程方式实施并不简单.您需要找到一种在不应用梯度的情况下提供给 LSTM 的方法 - 例如冻结重量或设置 lr = 0.
<小时>

LSTM 何时以及如何在有状态中传递状态"?

  • 何时:仅批次到批次;样本完全独立
  • 如何:在 Keras 中,只有 batch-sample to batch-sample:stateful=True 需要你指定 batch_shape 而不是 input_shape - 因为 Keras 在编译时构建 batch_size LSTM 的单独状态

根据上述情况,您不能这样做:

# sampleNM = 时间步长 M 处的样本 N批次 1 = [样品 10、样品 20、样品 30、样品 40]批次 2 = [样品 21、样品 41、样品 11、样品 31]

这意味着 21 因果地跟随 10 - 并且会破坏训练.而是这样做:

batch1 = [sample10, sample20, sample30, sample40]批次 2 = [样品 11、样品 21、样品 31、样品 41]

<小时>

批次与样本:附加信息

批次"是一组样本 - 1 个或更多(假设此答案总是在后者).迭代数据的三种方法:Batch Gradient Descent(一次整个数据集)、Stochastic GD(一次一个样本)和 Minibatch GD(中间).(然而,在实践中,我们也称最后一个 SGD,并且只区分与 BGD - 在这个答案中假设是这样.)差异:

  • SGD 从未真正优化训练集的损失函数——仅优化其近似值";每个批次都是整个数据集的一个子集,计算的梯度仅与最小化该批次的损失有关.批量越大,其损失函数与训练集的损失函数越相似.
  • 以上可以扩展到拟合批次与样本:样本是批次的近似值 - 或者,数据集的较差近似值
  • 首先拟合 16 个样本然后再拟合 16 个样本与一次拟合 32 个样本不同 - 因为权重在中间更新,所以模型输出后半部分会改变
  • 选择 SGD 而不是 BGD 的主要原因实际上并不是计算限制 - 而是它在大多数情况下优越.简单解释一下:使用 BGD 更容易过拟合,并且通过探索更多样化的损失空间,SGD 在测试数据上收敛到更好的解决方案.
<小时>

奖励图表:

<小时>

Let's suppose I have a sequence of integers:

0,1,2, ..

and want to predict the next integer given the last 3 integers, e.g.:

[0,1,2]->5, [3,4,5]->6, etc

Suppose I setup my model like so:

batch_size=1
time_steps=3
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, time_steps, 1), stateful=True))
model.add(Dense(1))

It is my understanding that model has the following structure (please excuse the crude drawing):

First Question: is my understanding correct?

Note I have drawn the previous states C_{t-1}, h_{t-1} entering the picture as this is exposed when specifying stateful=True. In this simple "next integer prediction" problem, the performance should improve by providing this extra information (as long as the previous state results from the previous 3 integers).

This brings me to my main question: It seems the standard practice (for example see this blog post and the TimeseriesGenerator keras preprocessing utility), is to feed a staggered set of inputs to the model during training.

For example:

batch0: [[0, 1, 2]]
batch1: [[1, 2, 3]]
batch2: [[2, 3, 4]]
etc

This has me confused because it seems this is requires the output of the 1st Lstm Cell (corresponding to the 1st time step). See this figure:

From the tensorflow docs:

stateful: Boolean (default False). If True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch.

it seems this "internal" state isn't available and all that is available is the final state. See this figure:

So, if my understanding is correct (which it's clearly not), shouldn't we be feeding non-overlapped windows of samples to the model when using stateful=True? E.g.:

batch0: [[0, 1, 2]]
batch1: [[3, 4, 5]]
batch2: [[6, 7, 8]]
etc

解决方案

The answer is: depends on problem at hand. For your case of one-step prediction - yes, you can, but you don't have to. But whether you do or not will significantly impact learning.


Batch vs. sample mechanism ("see AI" = see "additional info" section)

All models treat samples as independent examples; a batch of 32 samples is like feeding 1 sample at a time, 32 times (with differences - see AI). From model's perspective, data is split into the batch dimension, batch_shape[0], and the features dimensions, batch_shape[1:] - the two "don't talk." The only relation between the two is via the gradient (see AI).


Overlap vs no-overlap batch

Perhaps the best approach to understand it is information-based. I'll begin with timeseries binary classification, then tie it to prediction: suppose you have 10-minute EEG recordings, 240000 timesteps each. Task: seizure or non-seizure?

  • As 240k is too much for an RNN to handle, we use CNN for dimensionality reduction
  • We have the option to use "sliding windows" - i.e. feed a subsegment at a time; let's use 54k

Take 10 samples, shape (240000, 1). How to feed?

  1. (10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[54000:108000] ...
  2. (10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[1:54001] ...

Which of the two above do you take? If (2), your neural net will never confuse a seizure for a non-seizure for those 10 samples. But it'll also be clueless about any other sample. I.e., it will massively overfit, because the information it sees per iteration barely differs (1/54000 = 0.0019%) - so you're basically feeding it the same batch several times in a row. Now suppose (3):

  1. (10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[24000:81000] ...

A lot more reasonable; now our windows have a 50% overlap, rather than 99.998%.


Prediction: overlap bad?

If you are doing a one-step prediction, the information landscape is now changed:

  • Chances are, your sequence length is faaar from 240000, so overlaps of any kind don't suffer the "same batch several times" effect
  • Prediction fundamentally differs from classification in that, the labels (next timestep) differ for every subsample you feed; classification uses one for the entire sequence

This dramatically changes your loss function, and what is 'good practice' for minimizing it:

  • A predictor must be robust to its initial sample, especially for LSTM - so we train for every such "start" by sliding the sequence as you have shown
  • Since labels differ timestep-to-timestep, the loss function changes substantially timestep-to-timestep, so risks of overfitting are far less

What should I do?

First, make sure you understand this entire post, as nothing here's really "optional." Then, here's the key about overlap vs no-overlap, per batch:

  1. One sample shifted: model learns to better predict one step ahead for each starting step - meaning: (1) LSTM's robust against initial cell state; (2) LSTM predicts well for any step ahead given X steps behind
  2. Many samples, shifted in later batch: model less likely to 'memorize' train set and overfit

Your goal: balance the two; 1's main edge over 2 is:

  • 2 can handicap the model by making it forget seen samples
  • 1 allows model to extract better quality features by examining the sample over several starts and ends (labels), and averaging the gradient accordingly

Should I ever use (2) in prediction?

  • If your sequence lengths are very long and you can afford to "slide window" w/ ~50% its length, maybe, but depends on the nature of data: signals (EEG)? Yes. Stocks, weather? Doubt it.
  • Many-to-many prediction; more common to see (2), in large per longer sequences.

LSTM stateful: may actually be entirely useless for your problem.

Stateful is used when LSTM can't process the entire sequence at once, so it's "split up" - or when different gradients are desired from backpropagation. With former, the idea is - LSTM considers former sequence in its assessment of latter:

  • t0=seq[0:50]; t1=seq[50:100] makes sense; t0 logically leads to t1
  • seq[0:50] --> seq[1:51] makes no sense; t1 doesn't causally derive from t0

In other words: do not overlap in stateful in separate batches. Same batch is OK, as again, independence - no "state" between the samples.

When to use stateful: when LSTM benefits from considering previous batch in its assessment of the next. This can include one-step predictions, but only if you can't feed the entire seq at once:

  • Desired: 100 timesteps. Can do: 50. So we set up t0, t1 as in above's first bullet.
  • Problem: not straightforward to implement programmatically. You'll need to find a way to feed to LSTM while not applying gradients - e.g. freezing weights or setting lr = 0.

When and how does LSTM "pass states" in stateful?

  • When: only batch-to-batch; samples are entirely independent
  • How: in Keras, only batch-sample to batch-sample: stateful=True requires you to specify batch_shape instead of input_shape - because, Keras builds batch_size separate states of the LSTM at compiling

Per above, you cannot do this:

# sampleNM = sample N at timestep(s) M
batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample21, sample41, sample11, sample31]

This implies 21 causally follows 10 - and will wreck training. Instead do:

batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample11, sample21, sample31, sample41]


Batch vs. sample: additional info

A "batch" is a set of samples - 1 or greater (assume always latter for this answer) . Three approaches to iterate over data: Batch Gradient Descent (entire dataset at once), Stochastic GD (one sample at a time), and Minibatch GD (in-between). (In practice, however, we call the last SGD also and only distinguish vs BGD - assume it so for this answer.) Differences:

  • SGD never actually optimizes the train set's loss function - only its 'approximations'; every batch is a subset of the entire dataset, and the gradients computed only pertain to minimizing loss of that batch. The greater the batch size, the better its loss function resembles that of the train set.
  • Above can extend to fitting batch vs. sample: a sample is an approximation of the batch - or, a poorer approximation of the dataset
  • First fitting 16 samples and then 16 more is not the same as fitting 32 at once - since weights are updated in-between, so model outputs for the latter half will change
  • The main reason for picking SGD over BGD is not, in fact, computational limitations - but that it's superior, most of the time. Explained simply: a lot easier to overfit with BGD, and SGD converges to better solutions on test data by exploring a more diverse loss space.

BONUS DIAGRAMS:


这篇关于将时间序列数据提供给有状态 LSTM 的正确方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆