如何使用多个数据集训练 LSTM? [英] How to Train LSTM Using Multiple Datasets?

查看:168
本文介绍了如何使用多个数据集训练 LSTM?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尽我所能,我还没有找到这个问题的答案.

Try as I might, I have yet to find an answer for this question.

我想简单地使用 Python 3.6 和 TensorFlow 训练 LSTM 网络,使用多个 .csv 文件/数据集,例如使用多个公司的历史股票数据.

I’m wanting to simply train an LSTM network using Python 3.6 and TensorFlow, using multiple .csv files/datasets, like say for example using historical stock data for multiple companies.

这样做的原因是我想用各种价格范围来拟合模型,而不是在每个数据集上训练单独的模型.我该怎么做?

The reason for this is I want to fit the model with a wide variety of price ranges, and not train individual models on every dataset. How would I go about doing this?

我不能只是将一个数据集附加到另一个创建 1 个大数据集的数据集,因为在训练/测试拆分期间,价格可能会从 2 美元上涨到 200 美元,具体取决于库存数据和数据集拼接在一起的位置.

I can’t just append one dataset to another creating 1 big dataset because during the train/test split, prices may jump from $2 to $200 depending on the stock data and where the datasets are stitched together.

执行此类操作的最佳做​​法是什么?

What is the best practise for doing something like this?

  1. 只需为每个 .csv 文件创建一个循环并调用 .fit 函数以在每个文件上一个接一个地训练(随着它的进行更新其权重)一定数量的时期,并在达到最佳状态时使用提前停止丢失被发现?(我现在知道该怎么做了.)

  1. Just create a loop for every .csv file and call the .fit function to train on each file one after another (updating its weights as it goes) for a certain number of epochs and using early stopping once the optimal loss is found? (Which I understand how to do now.)

有没有办法创建一个生成器,它可以以某种方式从每个 .csv 生成不同的 x_train 和 y_train 元组,用每个元组拟合模型,然后在从每个元组采样一个元组后有一个训练检查点.csv 文件?我的想法是,模型应该有机会在完成一个 epoch 之前从每个数据集中抽取一个片段.

Is there a way to create a generator that could somehow yield a different x_train and y_train tuple from each .csv, fit the model with each tuple, and then have a training checkpoint after one tuple has been sampled from each .csv file? My thinking here is that the model should have a chance to sample a piece from each dataset before completing an epoch.

示例:假设我想使用 20 个周期的回顾/窗口大小来预测 t+1,并且我有 5 个 .csv 文件要训练.生成器将(理想情况下)将所有数据集加载到内存中,然后从第一个 .csv 文件中抽取 20 行的随机样本,将其拟合到模型中,然后从第二个 .csv 文件中抽取另外 20 行,进行拟合等,然后在对所有 5 个样本进行采样后,检查点以评估损失,然后进入下一个 epoch 并重新开始.

Example: let’s say I want to use a 20 period lookback/window size to predict t+1 ahead, and I have 5 .csv files to train with. The generator would (ideally) load all datasets into memory, and then pluck a random sample of 20 rows from the first .csv file, fit it to the model, then pluck another 20 rows from the second .csv, fit it, etc etc, and then once all 5 have been sampled, checkpoint to assess the loss, then move on to the next epoch and do it all over again.

这可能有点矫枉过正,但想要彻底.如果选项 1. 可以完成同样的事情,那对我来说也很好,只是我还没有找到答案.

This might be overkill but wanted to be thorough. And if option 1. would accomplish the same thing, that’s fine with me too, I just haven’t come across an answer yet.

谢谢!

更新

自从我提出这个问题以来,我制定解决方案(针对我的特定应用程序)的方法之一就是使用下面的代码.基本上,如果我提取几只不同股票的过去 5 年的股价数据,我会将一个数据集附加到另一个数据集之上,全部放入一个大数据集,然后在分配回顾"后遍历所有行;期间,那么 LSTM 应该回顾其特征的天数.然后它会查看日期列,只要每组 10 个特征的日期按升序排列,然后将这些特征组合在一起用于 LSTM.但是如果日期从 2020-09-01 到 2015-09-01,那就意味着数据集的那部分是新股票数据开始的地方,所以继续向下浏览文件,直到找到 10 行相关到一只股票.这将确保 LSTM 的 3D 特征形状仅适用于一种特定的股票.

Since I made this question, one of the ways I crafted my solution (for my particular application) was using the code below. Basically, if I pulled the last 5 years of stock price data for several different stocks, I would append one dataset on top of the other, all into one big dataset, then I would iterate through all the rows after assigning a "look back" period, so how many days the LSTM should look back in its features. It would then look at the date column, and so long as the dates for each group of 10 features were in ascending order, then bunch those features together for the LSTM. But if the date went from 2020-09-01 to 2015-09-01, that would mean that that part of the dataset was where a new stocks data would start, so just continue on down through the file until it finds 10 rows pertaining to one stock. This would make sure that the 3D shape of features for the LSTM was only for one particular stock.

希望这有点道理.我对这个函数的评价非常好,所以应该很容易看出它是如何工作的,然后定义了一个 GRU 模型来展示它如何从那里付诸实践:

Hope that makes some kind of sense. I commented the function pretty good so it should be easy to see how it works, then just defined a GRU model to show how it would be put into practise from there:

# A function to get a set of X's and Y's for training an LSTM, 
# so long as the dates are in ascending order, so you're not 
# stitching together different datasets X features from two 
# different datasets

def create_batched_dataset(x, y, time_steps=1): # Not really 1 if defined below, 10 by default
  
    x = x.reset_index() # Reset the index column so we can parse the dates
                        # to determine > or < among the dates
    x['Date'] = pd.to_datetime(x['Audit_Date']) # make the dates a datetime object

    xs, ys = [], [] # lists for our features/labels for LSTM

    for i in range(len(x) - time_steps): # Range 0 to 430 in my dataset

        v = x.iloc[i:(i + time_steps), :] # v = first 10 rows of X set

        if v['Date'].iloc[-1] <= v['Date'].iloc[0]: # Only batch from one training dataset, not where they stitch together.
                           # This checks that the last date and first date
                       # of the 10 rows are in the order they should be

            continue

        v = v.set_index(['Date']) # Set the index again

        xs.append(v.iloc[:, :-1].to_numpy()) # Append those 10 rows to your Xs list, without the target label in it

        ys.append(y.iloc[i + time_steps]) # Append their corresponding labels to Ys list, then continue

    return np.array(xs), np.array(ys) # np.array(xs


# Get our reshaped features/labels (to [samples, time_steps, n_features])
x_train, y_train = create_batched_dataset(train_scaled, train_scaled.iloc[:,-1], 10)
x_test, y_test = create_batched_dataset(test_scaled, test_scaled.iloc[:,-1], 10)


# Define some type of LSTM model
model = Sequential()
model.add(GRU(11, input_shape=(x_train.shape[1], x_train.shape[2])))
model.add(Dense(11, activation="relu"))
model.add(Dense(1))
model.compile(loss='mae', optimizer=Adam(1 / 1000))
print(model.summary())

更新 2这是使用列表的另一种解决方案.基本上,对于每个股票,我们都有一个数据集,导入 df,并将其股票价格数据添加到单独的列表中,然后将这些列表添加到一个主列表中.然后,当您准备好训练时,从主列表中随机抽取一个股票价格列表以输入您的神经网络.请注意,您必须在 NN 函数中定义 open = price[0]、high = price[1] 等.希望有所帮助:

UPDATE 2 Here is another solution using lists. Basically, for every ticker we have a dataset for, impor the df, and add its stock price data to individual lists, then add those lists to one master list. Then, when you're ready to train, randomly pull a list of stock prices from the master list to feed into your NN. Note, you'll have to define open = prices[0], high = prices[1], etc etc. inside your NN function. Hope that helps:

prices_library = []
for ticker in list_of_tickers: # Used for multiple tickers
    print(ticker)

    df = pd.read_csv('./' + ticker + '_' + interval + 'm.csv')
    
    open = df['Open'].values.tolist()
    high = df['High'].values.tolist()
    low = df['Low'].values.tolist()
    close = df['Close'].values.tolist()
    volume = df['Volume'].values.tolist()

    prices_library.append([date,
                            open,
                            high,
                            low,
                            close,
                            volume])

for i in range(len(prices_library) * iterations):
    print('Iteration: ' + str(i+1) + ' of ' + str(len(prices_library) * iterations))
    agent.train(iterations=1, checkpoint=1, initial_money=initial_money, prices=prices_library[random.randint(0,len(prices_library)-1)])

推荐答案

将所有 CSV 文件合并到一个文件中,并为其提供足够的步骤,使其涵盖所有这些文件.如果您进行预处理,您应该在一个训练文件中创建序列,每个序列有一行,其中每个序列具有给定 CSV 的 20 个左右的先前周期.这样当它们被随机输入模型时,每个序列对应正确的股票

Merge all of the CSVs together into one file and give it enough steps so that it covers all of them. If you preprocess, you should create sequences in one training file that has one row per sequence where each sequence has the 20 or so previous periods for a given CSV. That way when they are fed randomly into the model, each sequence corresponds to the correct stock

这篇关于如何使用多个数据集训练 LSTM?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆