创建具有多个输入的TimeseriesGenerator [英] Creating a TimeseriesGenerator with multiple inputs
问题描述
由于存储限制,在转换为模型的序列后,我无法将所有内容都保存在内存中.我正在尝试针对来自约4000只股票的每日基本面和价格数据训练LSTM模型.
I'm trying to train an LSTM model on daily fundamental and price data from ~4000 stocks, due to memory limits I cannot hold everything in memory after converting to sequences for the model.
这导致我改用生成器,例如 TimeseriesGenerator 来自Keras/Tensorflow.问题是,如果我尝试对所有堆积的数据使用生成器,则会创建混合股票序列,请参见下面的示例,其中包含5个序列,此处序列3 将包含对股票的最后4个观察值" stock 1 "和" stock 2 "
This leads me to using a generator instead like the TimeseriesGenerator from Keras / Tensorflow. Problem is that if I try using the generator on all of my data stacked it would create sequences of mixed stocks, see the example below with a sequence of 5, here Sequence 3 would include the last 4 observations of "stock 1" and the first observation of "stock 2"
相反,我想要的是这样的:
Instead what I would want is similar to this:
有点类似的问题:合并或追加多个Keras TimeseriesGenerator对象
我探索了将生成器组合在一起的选项,因此建议如下:我如何组合两个keras生成器函数,但是在大约4000个生成器的情况下,这不是个主意.
I explored the option of combining the generators like this SO suggests: How do I combine two keras generator functions, however this is not idea in the case of ~4000 generators.
我希望我的问题有意义.
I hope my question makes sense.
推荐答案
所以我最终要做的是手动进行所有预处理,并为包含预处理序列的每只股票保存一个.npy文件,然后手动使用创建了生成器,我进行了如下批量处理:
So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:
class seq_generator():
def __init__(self, list_of_filepaths):
self.usedDict = dict()
for path in list_of_filepaths:
self.usedDict[path] = []
def generate(self):
while True:
path = np.random.choice(list(self.usedDict.keys()))
stock_array = np.load(path)
random_sequence = np.random.randint(stock_array.shape[0])
if random_sequence not in self.usedDict[path]:
self.usedDict[path].append(random_sequence)
yield stock_array[random_sequence, :, :]
train_generator = seq_generator(list_of_filepaths)
train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
output_types=(tf.float32, tf.float32),
output_shapes=(n_timesteps, n_features))
train_dataset = train_dataset.batch(batch_size)
list_of_filepaths
只是经过预处理的.npy数据的路径列表.
Where list_of_filepaths
is simply a list of paths to preprocessed .npy data.
这将:
- 加载随机股票的预处理.npy数据
- 随机选择一个序列
- 检查序列的索引是否已在
usedDict
中使用
- 如果不是:
- 将该序列的索引附加到
usedDict
,以保持跟踪,以免两次向模型输入相同的数据 - 产生序列
- Load a random stock's preprocessed .npy data
- Pick a sequence at random
- Check if the index of the sequence has already been used in
usedDict
- If not:
- Append the index of that sequence to
usedDict
to keep track as to not feed the same data twice to the model - Yield the sequence
这意味着生成器将在每次调用"时从随机股票中馈入单个唯一序列,这使我能够使用Tensorflows
This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the
.from_generator()
and.batch()
methods from Tensorflows Dataset type.这篇关于创建具有多个输入的TimeseriesGenerator的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- Append the index of that sequence to
- 将该序列的索引附加到