Keras LSTM-使用来自发生器的Tensorflow数据集API的进料序列数据 [英] Keras LSTM - feed sequence data with Tensorflow dataset API from the generator

查看:89
本文介绍了Keras LSTM-使用来自发生器的Tensorflow数据集API的进料序列数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解决如何将数据馈送到LSTM模型进行培训的问题. (在下面的示例中,我将简化此问题.)我的数据集中的csv文件中具有以下数据格式.

I am trying to solve how I can feed data to my LSTM model for training. (I will simplify the problem in my example below.) I have the following data format in csv files in my dataset.

Timestep    Feature1    Feature2    Feature3    Feature4    Output
1           1           2           3           4           a
2           5           6           7           8           b
3           9           10          11          12          c 
4           13          14          15          16          d
5           17          18          19          20          e
6           21          22          23          24          f
7           25          26          27          28          g
8           29          30          31          32          h
9           33          34          35          36          i
10          37          38          39          40          j

任务是根据最近3个时间步长中的数据来估算将来任何时间步长的输出.一些输入输出示例如下:

The task is to estimate the Output of any future timestep based on the data from last 3 timesteps. Some input-output exapmles are as following:

示例1: 输入:

Timestep    Feature1    Feature2    Feature3    Feature4    
1           1           2           3           4           
2           5           6           7           8           
3           9           10          11          12           

输出:c

示例2: 输入:

Timestep    Feature1    Feature2    Feature3    Feature4    
2           5           6           7           8           
3           9           10          11          12           
4           13          14          15          16          

输出:d

示例3: 输入:

Timestep    Feature1    Feature2    Feature3    Feature4   
3           9           10          11          12          
4           13          14          15          16         
5           17          18          19          20         

输出:e

当将数据提供给模型时,我希望以某种方式对数据进行混洗,以便在训练时不提供连续的序列. 换句话说,理想情况下,我希望一步一步地馈入数据序列,例如时间步长3,4,5,下一步可能馈入时间步长5,6,7,然后在下一步馈入2,3,4,依此类推. 而且我最好不要先以1,2,3,然后是2,3,4,然后是3,4,5等方式输入数据……

And when feeding the data to the model, I would like to shuffle the data in a way so that I do not feed consecutive sequences when training. With other words, I ideally would like to feed the data sequences like timesteps 3,4,5 in one step, maybe timesteps 5,6,7 in the next step, and maybe 2,3,4 in the following step, and so on.. And I preferably do not want to feed the data as 1,2,3 first, then 2,3,4, then 3,4,5, and so on...

在训练我的LSTM网络时,我将Keras与Tensorflow后端一起使用.当我将数据输入到fit_generator(...)函数时,我想使用一个生成器.

When training my LSTM network, I am using Keras with Tensorflow backend. I would like to use a generator when feeding my data to the fit_generator(...) function.

我的愿望是使用Tensorflow的数据集API从csv文件中获取数据.但是我不知道如何使生成器返回我需要的东西. 如果我使用Tensorflow的数据集API进行数据混排,它将破坏时间步长的顺序.生成器还应返回包含多个序列示例的批次.例如,如果批次大小为2,则可能需要返回2个序列,例如时间步长2、3、4和时间步长6、7、8.

My desire is to use Tensorflow's dataset API to fetch the data from csv files. But I could not figure out how to make the generator return what I need. If I shuffle the data with Tensorflow's dataset API, it will destroy the order of the timesteps. The generator should also return batches that include multiple sequence examples. For instance, if the batch size is 2, then it may need to return 2 sequences like timesteps 2,3,4 and timesteps 6,7,8.

希望我能解释我的问题...是否可以在生成器函数中使用Tensorflow的数据集API来解决此类序列问题,以便如上所述地喂入一批序列? (生成器需要返回形状为[batch_size, length_of_each_sequence, nr_inputs_in_each_timestep]的数据,在我的示例中为length_of_each_sequence=3nr_of_inputs_in_each_timestep=4.)或者最好的方法是仅使用Python编写生成器,也许是使用Pandas ..?

Hoping that I could explain my problem... Is it possible to use Tensorflow's dataset API in a generator function for such a sequence problem so that I can feed batches of sequences as I explained above? (The generator needs to return data with the shape [batch_size, length_of_each_sequence, nr_inputs_in_each_timestep], where length_of_each_sequence=3 and nr_of_inputs_in_each_timestep=4 in my example.) Or is the best way to do this to write a generator in Python only, maybe by using Pandas..?

附录1:

看到@kvish的答案后,我做了以下实验.

I have done the following experiment after seeing the answer from @kvish.

import tensorflow as tf
import numpy as np
from tensorflow.contrib.data.python.ops import sliding

sequence = np.array([ [[1]], [[2]], [[3]], [[4]], [[5]], [[6]], [[7]], [[8]], [[9]] ])
labels = [1,0,1,0,1,0,1,0,1]

# create TensorFlow Dataset object
data = tf.data.Dataset.from_tensor_slices((sequence, labels))

# sliding window batch
window_size = 3
window_shift = 1
data = data.apply(sliding.sliding_window_batch(window_size=window_size, window_shift=window_shift))
data = data.shuffle(1000, reshuffle_each_iteration=False)
data = data.batch(3)

#iter = dataset.make_initializable_iterator()
iter = tf.data.Iterator.from_structure(data.output_types, data.output_shapes)
el = iter.get_next()

# create initialization ops 
init_op = iter.make_initializer(data)

NR_EPOCHS = 2
with tf.Session() as sess:
    for e in range (NR_EPOCHS):
      print("\nepoch: ", e, "\n")
      sess.run(init_op)
      print("1  ", sess.run(el))
      print("2  ", sess.run(el))
      print("3  ", sess.run(el))

这是输出:

epoch:  0 

1   (array([[[[6]],[[7]],[[8]]],  [[[1]],[[2]],[[3]]],  [[[2]],[[3]],[[4]]]]), 
     array([[0, 1, 0],  [1, 0, 1],  [0, 1, 0]], dtype=int32))

2   (array([[[[7]],[[8]],[[9]]],  [[[3]],[[4]],[[5]]],  [[[4]],[[5]],[[6]]]]), 
     array([[1, 0, 1],  [1, 0, 1],  [0, 1, 0]], dtype=int32))

3   (array([[[[5]],[[6]],[[7]]]]), array([[1, 0, 1]], dtype=int32))

epoch:  1 

1   (array([[[[2]],[[3]],[[4]]],  [[[7]],[[8]],[[9]]],  [[[1]],[[2]],[[3]]]]), 
     array([[0, 1, 0],  [1, 0, 1],  [1, 0, 1]], dtype=int32))

2   (array([[[[5]],[[6]],[[7]]],  [[[3]],[[4]],[[5]]],  [[[4]],[[5]],[[6]]]]), 
     array([[1, 0, 1],  [1, 0, 1],  [0, 1, 0]], dtype=int32))

3   (array([[[[6]],[[7]],[[8]]]]), 
     array([[0, 1, 0]], dtype=int32))

我还无法在csv文件读取中尝试使用它,但是我认为这种方法应该可以正常工作!

I could not try it on csv file reading yet but I think that this approach should be working quite fine!

但是正如我所看到的,reshuffle_each_iteration参数没有什么区别.这真的需要吗?将其设置为TrueFalse时,结果不一定相同.这个reshuffle_each_iteration参数应该在这里做什么?

But as I see it, the reshuffle_each_iteration parameter is making no difference. Is this really needed? Results are not necessarily identical when it is set to True or False. What is this reshuffle_each_iteration parameter supposed to do here?

推荐答案

我认为这个答案可能与您正在寻找!

I think this answer might be close to what you are looking for!

您可以通过在窗口上滑动来创建批处理,然后根据情况改组输入.数据集api的 shuffle 函数具有reshuffle_after_each_iteration参数,如果您想尝试设置随机种子并查看随机输出的顺序,则可能需要将其设置为False.

You create batches by sliding over windows, and then shuffle the input in your case. The shuffle function of the dataset api has a reshuffle_after_each_iteration parameter, which you might probably want to set to False if you want to experiment with setting a random seed and looking at the order of shuffled outputs.

这篇关于Keras LSTM-使用来自发生器的Tensorflow数据集API的进料序列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆