创建具有多个输入的TimeseriesGenerator [英] Creating a TimeseriesGenerator with multiple inputs

查看:380
本文介绍了创建具有多个输入的TimeseriesGenerator的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于存储限制,在转换为模型的序列后,我无法将所有内容都保存在内存中.我正在尝试针对来自约4000只股票的每日基本面和价格数据训练LSTM模型.

I'm trying to train an LSTM model on daily fundamental and price data from ~4000 stocks, due to memory limits I cannot hold everything in memory after converting to sequences for the model.

这导致我改用生成器,例如 TimeseriesGenerator 来自Keras/Tensorflow.问题是,如果我尝试对所有堆积的数据使用生成器,则会创建混合股票序列,请参见下面的示例,其中包含5个序列,此处序列3 将包含对股票的最后4个观察值" stock 1 "和" stock 2 "

This leads me to using a generator instead like the TimeseriesGenerator from Keras / Tensorflow. Problem is that if I try using the generator on all of my data stacked it would create sequences of mixed stocks, see the example below with a sequence of 5, here Sequence 3 would include the last 4 observations of "stock 1" and the first observation of "stock 2"

相反,我想要的是这样的:

Instead what I would want is similar to this:

有点类似的问题:合并或追加多个Keras TimeseriesGenerator对象

我探索了将生成器组合在一起的选项,因此建议如下:我如何组合两个keras生成器函数,但是在大约4000个生成器的情况下,这不是个主意.

I explored the option of combining the generators like this SO suggests: How do I combine two keras generator functions, however this is not idea in the case of ~4000 generators.

我希望我的问题有意义.

I hope my question makes sense.

推荐答案

所以我最终要做的是手动进行所有预处理,并为包含预处理序列的每只股票保存一个.npy文件,然后手动使用创建了生成器,我进行了如下批量处理:

So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:

class seq_generator():

  def __init__(self, list_of_filepaths):
    self.usedDict = dict()
    for path in list_of_filepaths:
      self.usedDict[path] = []

  def generate(self):
    while True: 
      path = np.random.choice(list(self.usedDict.keys()))
      stock_array = np.load(path) 
      random_sequence = np.random.randint(stock_array.shape[0])
      if random_sequence not in self.usedDict[path]:
        self.usedDict[path].append(random_sequence)
        yield stock_array[random_sequence, :, :]

train_generator = seq_generator(list_of_filepaths)

train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
                                               output_types=(tf.float32, tf.float32), 
                                               output_shapes=(n_timesteps, n_features)) 

train_dataset = train_dataset.batch(batch_size)

list_of_filepaths只是经过预处理的.npy数据的路径列表.

Where list_of_filepaths is simply a list of paths to preprocessed .npy data.

这将:

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆