将多个Keras TimeseriesGenerator对象合并或追加到一个对象中 [英] Merge or append multiple Keras TimeseriesGenerator objects into one
问题描述
我正在尝试制作LSTM模型.数据来自包含多个股票价值的csv文件.
I'm trying to make a LSTM model. The data is coming from a csv file that contains values for multiple stocks.
我无法使用文件中显示的所有行来创建序列,因为每个序列仅在其自己的股票上下文中相关,因此我需要为每个股票选择行并基于那个.
I can't use all the rows as they appear in the file to make sequences because each sequence is only relevant in the context of its own stock, so I need to select the rows for each stock and make the sequences based on that.
我有这样的东西:
for stock in stocks:
stock_df = df.loc[(df['symbol'] == stock)].copy()
target = stock_df.pop('price')
x = np.array(stock_df.values)
y = np.array(target.values)
sequence = TimeseriesGenerator(x, y, length = 4, sampling_rate = 1, batch_size = 1)
这很好,但是我想将每个序列合并成一个更大的序列,用于训练,并包含所有股票的数据.
That works fine, but then I want to merge each of those sequences into a bigger one that I will use for training and that contains the data for all the stocks.
无法使用附加或合并,因为该函数返回的是生成器对象,而不是numpy数组.
It is not possible to use append or merge because the function return a generator object, not a numpy array.
推荐答案
新答案:
所以我最终要做的是手动进行所有预处理,并为包含预处理序列的每只股票保存一个.npy文件,然后使用手动创建的生成器进行如下批量处理:
New answer:
So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:
class seq_generator():
def __init__(self, list_of_filepaths):
self.usedDict = dict()
for path in list_of_filepaths:
self.usedDict[path] = []
def generate(self):
while True:
path = np.random.choice(list(self.usedDict.keys()))
stock_array = np.load(path)
random_sequence = np.random.randint(stock_array.shape[0])
if random_sequence not in self.usedDict[path]:
self.usedDict[path].append(random_sequence)
yield stock_array[random_sequence, :, :]
train_generator = seq_generator(list_of_filepaths)
train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
output_types=(tf.float32, tf.float32),
output_shapes=(n_timesteps, n_features))
train_dataset = train_dataset.batch(batch_size)
list_of_filepaths
只是经过预处理的.npy数据的路径列表.
Where list_of_filepaths
is simply a list of paths to preprocessed .npy data.
这将:
- 加载随机股票的预处理.npy数据
- 随机选择一个序列
- 检查序列的索引是否已在
usedDict
中使用
- 如果不是:
- 将该序列的索引附加到
usedDict
,以保持跟踪,以免两次向模型输入相同的数据 - 产生序列
- Load a random stock's preprocessed .npy data
- Pick a sequence at random
- Check if the index of the sequence has already been used in
usedDict
- If not:
- Append the index of that sequence to
usedDict
to keep track as to not feed the same data twice to the model - Yield the sequence
这意味着生成器将在每次调用"时从随机股票中馈入单个唯一序列,这使我能够使用Tensorflows
This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the
.from_generator()
and.batch()
methods from Tensorflows Dataset type.我认为@TF_Support的答案略有遗漏.如果我理解您的问题,那并不是说您想训练一个模型专家.库存,您要在整个数据集中训练一个模型.
I think the answer from @TF_Support is slightly missing the point. If I understand your question It's not as if you want to train one model pr. stock, you want one model trained on the entire dataset.
如果您有足够的内存,则可以手动创建序列并将整个数据集保存在内存中.我面临的问题是类似的,我只是无法将所有内容都保存在内存中:创建具有多个输入的TimeseriesGenerator .
If you have enough memory you could manually create the sequences and hold the entire dataset in memory. The issue I'm facing is similar, I simply can't hold everything in memory: Creating a TimeseriesGenerator with multiple inputs.
相反,我正在探索分别预处理每个库存的所有数据,另存为.npy文件,然后使用生成器加载这些.npy文件的随机样本以将数据批处理到模型的可能性,我不是完全可以确定如何解决这个问题.
Instead I'm exploring the possibility of preprocessing all data for each stock seperately, saving as .npy files and then using a generator to load a random sample of those .npy files to batch data to the model, I'm not entirely sure how to approach this yet though.
这篇关于将多个Keras TimeseriesGenerator对象合并或追加到一个对象中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- Append the index of that sequence to
- 将该序列的索引附加到