选择随机验证数据集 [英] Choose random validation data set

查看:59
本文介绍了选择随机验证数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个numpy数组,其中包含从模拟持续进行的时间生成的数据.基于此,我正在使用tensorflow和keras来训练神经网络,而我的问题是在我的模型中引用了以下代码行:

Given a numpy array consisting of data which has been generated for ongoing time from a simulation. Based on this I'm using tensorflow and keras to train a neural network and my question refers to this line of code in my model:

model.fit(X1, Y1, epochs=1000, batch_size=100, verbose=1, shuffle=True, validation_split=0.2)

阅读Keras的文档后,我发现验证数据集(在这种情况下为原始数据的20%)是从末尾切片的.当我在生成持续时间的数据时,我显然不希望分割最后一部分,因为它不能代表验证.我希望从整个数据集中随机选择验证数据.为此,我现在要训练我的整个数据集(一致地用于ANN的输入和输出),然后再训练以获取随机验证数据.

After having read in the documentation of Keras I found out that the validation data set (in this case 20% of the original data) is sliced from the end. As Im generating data for ongoing time I obviously don't want the last part to be sliced off because it would not be representative for validation. I'd rather want the validation data to be chosen randomly from the whole data set. For this purpose I am right now shuffling my whole data set (inputs and outputs for the ANN in unison) before training to gain random validation data.

我觉得我不想破坏数据中的时间部分,这就是为什么我要寻找一种解决方案,以随机选择验证集而不必重新整理整个数据集.另外,我想知道你们对不改组连续时间数据的看法.同样,我不是在问验证拆分的性质,我只是想知道如何修改选择验证数据的方式.

I feel like I don't want to ruin the time component in my data which is why I'm searching for a solution to just choose the validation set randomly without having to shuffle the whole data set. Also, I'd like to know what you guys think of not shuffling time continuous data. Again, I'm not asking about the nature of the validation split, I just want to know how to modify the manner of how the validation data is selected.

推荐答案

如前所述,Keras只是获取数据集的最后x个样本,因此,如果要继续使用它,则需要提前对数据集进行洗牌.

As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.

或者,您可以简单地使用 sklearn train_test_split()方法:

Or, your can simply use the sklearn train_test_split() method:

x_train, x_valid, y_train, y_valid = sklearn.model_selection.train_test_split(x, y, test_size=0.2)

此方法具有一个名为"shuffle"的参数,该参数确定是否在拆分之前对数据进行混洗(默认情况下将其设置为True).

This method has an argument named "shuffle" which determines whether to shuffle the data prior to the split (it is set on True by default).

但是,使用"stratify"参数可以更好地拆分数据,这将在验证和训练集之间提供相似的标签分布:

However, a better split of the data would be by using the "stratify" argument, which will provide a similar distribution of labels among the validation and training sets:

x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                    test_size=0.2,
                                                    random_state=0,
                                                    stratify=y)

这篇关于选择随机验证数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆