LSTM 自编码器问题 [英] LSTM Autoencoder problems

查看:50
本文介绍了LSTM 自编码器问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TLDR:

自编码器欠拟合时间序列重建,仅预测平均值.

问题设置:

这是我尝试使用序列到序列自动编码器的总结.此图片取自本文:

编码器: 标准 LSTM 层.输入序列在最终隐藏状态中进行编码.

解码器: LSTM Cell(我认为!).从最后一个元素x[N]开始,一次重构一个元素.

对于长度为N的序列,解码器算法如下:

  1. 获取解码器初始隐藏状态hs[N]:只需使用编码器最终隐藏状态即可.
  2. 重构序列中的最后一个元素:x[N]= w.dot(hs[N]) + b.
  3. 其他元素的相同模式:x[i]= w.dot(hs[i]) + b
  4. 使用 x[i]hs[i] 作为 LSTMCell 的输入来得到 x[i-1]hs[i-1]

最小工作示例:

这是我的实现,从编码器开始:

class SeqEncoderLSTM(nn.Module):def __init__(self, n_features,latent_size):super(SeqEncoderLSTM, self).__init__()self.lstm = nn.LSTM(n_特征,潜在大小,batch_first=真)def forward(self, x):_, hs = self.lstm(x)返回 hs

解码器类:

class SeqDecoderLSTM(nn.Module):def __init__(self, emb_size, n_features):super(SeqDecoderLSTM, self).__init__()self.cell = nn.LSTMCell(n_features, emb_size)self.dense = nn.Linear(emb_size, n_features)def forward(self, hs_0, seq_len):x = torch.tensor([])# 来自编码器的最终隐藏和单元状态hs_i, cs_i = hs_0# 用编码器输出重建第一个元素x_i = self.dense(hs_i)x = torch.cat([x, x_i])# 重建剩余元素对于范围内的 i (1, seq_len):hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))x_i = self.dense(hs_i)x = torch.cat([x, x_i])返回 x

将两者结合起来:

class LSTMEncoderDecoder(nn.Module):def __init__(self, n_features, emb_size):super(LSTMEncoderDecoder, self).__init__()self.n_features = n_featuresself.hidden_​​size = emb_sizeself.encoder = SeqEncoderLSTM(n_features, emb_size)self.decoder = SeqDecoderLSTM(emb_size, n_features)def forward(self, x):seq_len = x.shape[1]hs = self.encoder(x)hs = tuple([h.squeeze(0) for h in hs])out = self.decoder(hs, seq_len)返回 out.unsqueeze(0)

这是我的训练函数:

def train_encoder(model, epochs, trainload, testload=None,criteria=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6, reverse=False):device = 'cuda' if torch.cuda.is_available() else 'cpu'打印(f'{device} 上的训练模型')模型 = 模型.to(设备)opt = 优化器(model.parameters(), lr)train_loss = []valid_loss = []对于 tqdm(range(epochs)) 中的 e:运行_tl = 0运行_vl = 0对于火车负载中的 x:x = x.to(device).float()opt.zero_grad()x_hat = 模型(x)如果反转:x = torch.flip(x, [1])损失=标准(x_hat,x)损失.向后()opt.step()running_tl += loss.item()如果测试负载不是无:模型.评估()使用 torch.no_grad():对于测试负载中的 x:x = x.to(device).float()损失=标准(模型(x),x)running_vl += loss.item()valid_loss.append(running_vl/len(testload))模型.train()train_loss.append(running_tl/len(trainload))返回 train_loss,valid_loss

数据:

从新闻 (ICEWS) 中抓取的大型事件数据集.存在描述每个事件的各种类别.我最初对这些变量进行单热编码,将数据扩展到 274 维.但是,为了调试模型,我将其缩减为一个 14 个时间步长且仅包含 5 个变量的序列.这是我试图过度拟合的序列:

张量([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],[0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)

这里是自定义的 Dataset 类:

class TimeseriesDataSet(Dataset):def __init__(self, data, window, n_features,overlap=0):super().__init__()如果是实例(数据,(np.ndarray)):数据 = 火炬.张量(数据)elif isinstance(data, (pd.Series, pd.DataFrame)):数据 = torch.tensor(data.copy().to_numpy())别的:raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")self.n_features = n_featuresself.seqs = torch.split(数据,窗口)def __len__(self):返回 len(self.seqs)def __getitem__(self, idx):尝试:返回 self.seqs[idx].view(-1, self.n_features)除了类型错误:raise TypeError(数据集只接受整数索引/切片,不接受列表/数组.")

问题:

模型只会学习平均值,无论我制作的模型有多复杂,也不管我现在训练它多长时间.

预测/重建:

实际:

我的研究:

此问题与此问题中讨论的问题相同:

达到 1000 次迭代限制

减去,小模型

  • HIDDEN_SIZE=5
  • SUBTRACT=True

目标现在远离平线,但由于容量太小,模型无法拟合.

达到 1000 次迭代限制

没有减法,更大的模型

  • HIDDEN_SIZE=100
  • SUBTRACT=False

情况好多了,我们的目标在 942 步后就达到了.没有更多的扁平线,模型容量似乎很好(对于这个单一的例子!)

减去,更大的模型

  • HIDDEN_SIZE=100
  • SUBTRACT=True

虽然图表看起来不那么漂亮,但我们只在 215 次迭代后就得到了期望的损失.

终于

  • 通常使用时间步长的差异而不是时间步长(或其他一些转换,请参阅此处了解更多信息).在其他情况下,神经网络将尝试简单地......复制上一步的输出(因为这是最容易做的事情).以这种方式会找到一些最小值,而摆脱它需要更多的容量.
  • 当您使用时间步长之间的差异时,无法外推"上一时间步长的趋势;神经网络必须学习函数实际上是如何变化的
  • 使用更大的模型(对于整个数据集,我认为您应该尝试像 300 这样的模型),但您可以简单地调整该模型.
  • 不要使用 flipud.使用双向 LSTM,这样您就可以从 LSTM 的前向和后向传递中获取信息(不要与反向传播混淆!).这也应该会提高你的分数

问题

<块引用>

好的,问题 1:你是说对于变量 x 在时间系列,我应该训练模型学习 x[i] - x[i-1] 而不是x[i] 的值?我的解释正确吗?

是的,没错.差异消除了神经网络过度基于过去时间步长进行预测的冲动(通过简单地获取最后一个值并可能稍微改变它)

<块引用>

问题 2:你说我对零瓶颈的计算是不正确.但是,例如,假设我正在使用一个简单的密集网络作为自动编码器.确实找到了正确的瓶颈取决于数据.但是如果你使瓶颈的大小与输入,你得到恒等函数.

是的,假设不涉及非线性,这会使事情变得更难(请参阅此处 类似情况).如果 LSTM 存在非线性,那就是一点.

另一个是我们将 timesteps 累积到单个编码器状态.因此,基本上我们必须将 timesteps 身份累积到单个隐藏状态和单元格状态中,这是极不可能的.

最后一点,根据序列的长度,LSTM 容易忘记一些最不相关的信息(这就是它们的设计目的,不仅要记住所有内容),因此更不可能.

<块引用>

num_features * num_timesteps 不是与输入,因此不应该促进模型学习身份?

是的,但它假设您有每个数据点的 num_timesteps,这种情况很少发生,可能在这里.关于恒等式以及为什么上面回答的网络很难处理非线性.

最后一点,关于恒等函数;如果它们真的很容易学习,ResNet 的架构就不太可能成功.网络可以收敛到身份并进行小修复"到没有它的输出,事实并非如此.

<块引用>

我对以下语句很好奇:始终使用时间步长差异而不是时间步长"它似乎有一些标准化的效果将所有功能更紧密地结合在一起,但我不明白为什么这是关键?拥有更大的模型似乎是解决方案,减法只是帮助.

这里的关键确实是增加模型容量.减法技巧实际上取决于数据.让我们想象一个极端的情况:

  • 我们有 100 个时间步长,单个特征
  • 初始时间步长值为 10000
  • 其他时间步长值最多变化1

神经网络会做什么(这里最简单的是什么)?它可能会将这个 1 或更小的变化作为噪声丢弃,并只预测 1000 的所有变化(特别是如果一些正则化到位),因为 1code>1/1000 不多.

如果我们减去呢?整个神经网络的损失在每个时间步的 [0, 1] 余量中,而不是 [0, 1001] ,因此错误更严重.

是的,从某种意义上说,它与规范化有关.

TLDR:

Autoencoder underfits timeseries reconstruction and just predicts average value.

Question Set-up:

Here is a summary of my attempt at a sequence-to-sequence autoencoder. This image was taken from this paper: https://arxiv.org/pdf/1607.00148.pdf

Encoder: Standard LSTM layer. Input sequence is encoded in the final hidden state.

Decoder: LSTM Cell (I think!). Reconstruct the sequence one element at a time, starting with the last element x[N].

Decoder algorithm is as follows for a sequence of length N:

  1. Get Decoder initial hidden state hs[N]: Just use encoder final hidden state.
  2. Reconstruct last element in the sequence: x[N]= w.dot(hs[N]) + b.
  3. Same pattern for other elements: x[i]= w.dot(hs[i]) + b
  4. use x[i] and hs[i] as inputs to LSTMCell to get x[i-1] and hs[i-1]

Minimum Working Example:

Here is my implementation, starting with the encoder:

class SeqEncoderLSTM(nn.Module):
    def __init__(self, n_features, latent_size):
        super(SeqEncoderLSTM, self).__init__()
        
        self.lstm = nn.LSTM(
            n_features, 
            latent_size, 
            batch_first=True)
        
    def forward(self, x):
        _, hs = self.lstm(x)
        return hs

Decoder class:

class SeqDecoderLSTM(nn.Module):
    def __init__(self, emb_size, n_features):
        super(SeqDecoderLSTM, self).__init__()
        
        self.cell = nn.LSTMCell(n_features, emb_size)
        self.dense = nn.Linear(emb_size, n_features)
        
    def forward(self, hs_0, seq_len):
        
        x = torch.tensor([])
        
        # Final hidden and cell state from encoder
        hs_i, cs_i = hs_0
        
        # reconstruct first element with encoder output
        x_i = self.dense(hs_i)
        x = torch.cat([x, x_i])
        
        # reconstruct remaining elements
        for i in range(1, seq_len):
            hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
            x_i = self.dense(hs_i)
            x = torch.cat([x, x_i])
        return x

Bringing the two together:

class LSTMEncoderDecoder(nn.Module):
    def __init__(self, n_features, emb_size):
        super(LSTMEncoderDecoder, self).__init__()
        self.n_features = n_features
        self.hidden_size = emb_size

        self.encoder = SeqEncoderLSTM(n_features, emb_size)
        self.decoder = SeqDecoderLSTM(emb_size, n_features)
    
    def forward(self, x):
        seq_len = x.shape[1]
        hs = self.encoder(x)
        hs = tuple([h.squeeze(0) for h in hs])
        out = self.decoder(hs, seq_len)
        return out.unsqueeze(0)        

And here's my training function:

def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6,  reverse=False):

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Training model on {device}')
    model = model.to(device)
    opt = optimizer(model.parameters(), lr)

    train_loss = []
    valid_loss = []

    for e in tqdm(range(epochs)):
        running_tl = 0
        running_vl = 0
        for x in trainload:
            x = x.to(device).float()
            opt.zero_grad()
            x_hat = model(x)
            if reverse:
                x = torch.flip(x, [1])
            loss = criterion(x_hat, x)
            loss.backward()
            opt.step()
            running_tl += loss.item()

        if testload is not None:
            model.eval()
            with torch.no_grad():
                for x in testload:
                    x = x.to(device).float()
                    loss = criterion(model(x), x)
                    running_vl += loss.item()
                valid_loss.append(running_vl / len(testload))
            model.train()
            
        train_loss.append(running_tl / len(trainload))
    
    return train_loss, valid_loss

Data:

Large dataset of events scraped from the news (ICEWS). Various categories exist that describe each event. I initially one-hot encoded these variables, expanding the data to 274 dimensions. However, in order to debug the model, I've cut it down to a single sequence that is 14 timesteps long and only contains 5 variables. Here is the sequence I'm trying to overfit:

tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
        [0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
        [0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
        [0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
        [0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
        [0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
        [0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
        [0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
        [0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
        [0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
        [0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
        [0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
        [0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
        [0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)

And here is the custom Dataset class:

class TimeseriesDataSet(Dataset):
    def __init__(self, data, window, n_features, overlap=0):
        super().__init__()
        if isinstance(data, (np.ndarray)):
            data = torch.tensor(data)
        elif isinstance(data, (pd.Series, pd.DataFrame)):
            data = torch.tensor(data.copy().to_numpy())
        else: 
            raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
        
        self.n_features = n_features
        self.seqs = torch.split(data, window)
        
    def __len__(self):
        return len(self.seqs)
    
    def __getitem__(self, idx):
        try:    
            return self.seqs[idx].view(-1, self.n_features)
        except TypeError:
            raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")

Problem:

The model only learns the average, no matter how complex I make the model or now long I train it.

Predicted/Reconstruction:

Actual:

My research:

This problem is identical to the one discussed in this question: LSTM autoencoder always returns the average of the input sequence

The problem in that case ended up being that the objective function was averaging the target timeseries before calculating loss. This was due to some broadcasting errors because the author didn't have the right sized inputs to the objective function.

In my case, I do not see this being the issue. I have checked and double checked that all of my dimensions/sizes line up. I am at a loss.

Other Things I've Tried

  1. I've tried this with varied sequence lengths from 7 timesteps to 100 time steps.
  2. I've tried with varied number of variables in the time series. I've tried with univariate all the way to all 274 variables that the data contains.
  3. I've tried with various reduction parameters on the nn.MSELoss module. The paper calls for sum, but I've tried both sum and mean. No difference.
  4. The paper calls for reconstructing the sequence in reverse order (see graphic above). I have tried this method using the flipud on the original input (after training but before calculating loss). This makes no difference.
  5. I tried making the model more complex by adding an extra LSTM layer in the encoder.
  6. I've tried playing with the latent space. I've tried from 50% of the input number of features to 150%.
  7. I've tried overfitting a single sequence (provided in the Data section above).

Question:

What is causing my model to predict the average and how do I fix it?

解决方案

Okay, after some debugging I think I know the reasons.

TLDR

  • You try to predict next timestep value instead of difference between current timestep and the previous one
  • Your hidden_features number is too small making the model unable to fit even a single sample

Analysis

Code used

Let's start with the code (model is the same):

import seaborn as sns
import matplotlib.pyplot as plt

def get_data(subtract: bool = False):
    # (1, 14, 5)
    input_tensor = torch.tensor(
        [
            [0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
            [0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
            [0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
            [0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
            [0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
            [0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
            [0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
            [0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
            [0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
            [0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
            [0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
            [0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
            [0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
            [0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
        ]
    ).unsqueeze(0)

    if subtract:
        initial_values = input_tensor[:, 0, :]
        input_tensor -= torch.roll(input_tensor, 1, 1)
        input_tensor[:, 0, :] = initial_values
    return input_tensor


if __name__ == "__main__":
    torch.manual_seed(0)

    HIDDEN_SIZE = 10
    SUBTRACT = False

    input_tensor = get_data(SUBTRACT)
    model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
    optimizer = torch.optim.Adam(model.parameters())
    criterion = torch.nn.MSELoss()
    for i in range(1000):
        outputs = model(input_tensor)
        loss = criterion(outputs, input_tensor)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        print(f"{i}: {loss}")
        if loss < 1e-4:
            break

    # Plotting
    sns.lineplot(data=outputs.detach().numpy().squeeze())
    sns.lineplot(data=input_tensor.detach().numpy().squeeze())
    plt.show()

What it does:

  • get_data either works on the data your provided if subtract=False or (if subtract=True) it subtracts value of the previous timestep from the current timestep
  • Rest of the code optimizes the model until 1e-4 loss reached (so we can compare how model's capacity and it's increase helps and what happens when we use the difference of timesteps instead of timesteps)

We will only vary HIDDEN_SIZE and SUBTRACT parameters!

NO SUBTRACT, SMALL MODEL

  • HIDDEN_SIZE=5
  • SUBTRACT=False

In this case we get a straight line. Model is unable to fit and grasp the phenomena presented in the data (hence flat lines you mentioned).

1000 iterations limit reached

SUBTRACT, SMALL MODEL

  • HIDDEN_SIZE=5
  • SUBTRACT=True

Targets are now far from flat lines, but model is unable to fit due to too small capacity.

1000 iterations limit reached

NO SUBTRACT, LARGER MODEL

  • HIDDEN_SIZE=100
  • SUBTRACT=False

It got a lot better and our target was hit after 942 steps. No more flat lines, model capacity seems quite fine (for this single example!)

SUBTRACT, LARGER MODEL

  • HIDDEN_SIZE=100
  • SUBTRACT=True

Although the graph does not look that pretty, we got to desired loss after only 215 iterations.

Finally

  • Usually use difference of timesteps instead of timesteps (or some other transformation, see here for more info about that). In other cases, neural network will try to simply... copy output from the previous step (as that's the easiest thing to do). Some minima will be found this way and going out of it will require more capacity.
  • When you use the difference between timesteps there is no way to "extrapolate" the trend from previous timestep; neural network has to learn how the function actually varies
  • Use larger model (for the whole dataset you should try something like 300 I think), but you can simply tune that one.
  • Don't use flipud. Use bidirectional LSTMs, in this way you can get info from forward and backward pass of LSTM (not to confuse with backprop!). This also should boost your score

Questions

Okay, question 1: You are saying that for variable x in the time series, I should train the model to learn x[i] - x[i-1] rather than the value of x[i]? Am I correctly interpreting?

Yes, exactly. Difference removes the urge of the neural network to base it's predictions on the past timestep too much (by simply getting last value and maybe changing it a little)

Question 2: You said my calculations for zero bottleneck were incorrect. But, for example, let's say I'm using a simple dense network as an auto encoder. Getting the right bottleneck indeed depends on the data. But if you make the bottleneck the same size as the input, you get the identity function.

Yes, assuming that there is no non-linearity involved which makes the thing harder (see here for similar case). In case of LSTMs there are non-linearites, that's one point.

Another one is that we are accumulating timesteps into single encoder state. So essentially we would have to accumulate timesteps identities into a single hidden and cell states which is highly unlikely.

One last point, depending on the length of sequence, LSTMs are prone to forgetting some of the least relevant information (that's what they were designed to do, not only to remember everything), hence even more unlikely.

Is num_features * num_timesteps not a bottle neck of the same size as the input, and therefore shouldn't it facilitate the model learning the identity?

It is, but it assumes you have num_timesteps for each data point, which is rarely the case, might be here. About the identity and why it is hard to do with non-linearities for the network it was answered above.

One last point, about identity functions; if they were actually easy to learn, ResNets architectures would be unlikely to succeed. Network could converge to identity and make "small fixes" to the output without it, which is not the case.

I'm curious about the statement : "always use difference of timesteps instead of timesteps" It seem to have some normalizing effect by bringing all the features closer together but I don't understand why this is key ? Having a larger model seemed to be the solution and the substract is just helping.

Key here was, indeed, increasing model capacity. Subtraction trick depends on the data really. Let's imagine an extreme situation:

  • We have 100 timesteps, single feature
  • Initial timestep value is 10000
  • Other timestep values vary by 1 at most

What the neural network would do (what is the easiest here)? It would, probably, discard this 1 or smaller change as noise and just predict 1000 for all of them (especially if some regularization is in place), as being off by 1/1000 is not much.

What if we subtract? Whole neural network loss is in the [0, 1] margin for each timestep instead of [0, 1001], hence it is more severe to be wrong.

And yes, it is connected to normalization in some sense come to think about it.

这篇关于LSTM 自编码器问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆