制作Keras模型时将数据拆分为训练,测试和评估数据 [英] Splitting data to training, testing and valuation when making Keras model

查看:57
本文介绍了制作Keras模型时将数据拆分为训练,测试和评估数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在制作和评估Keras机器学习模型时,我对拆分数据集有些困惑.可以说我有1000行的数据集.

 功能= df.iloc [:,:-1]结果= df.iloc [:,-1] 

现在我想将此数据分为训练和测试(测试数据的33%,训练数据的67%):

  x_train,X_test,y_train,y_test = train_test_split(功能,结果,test_size = 0.33) 

我已经在互联网上阅读了将数据拟合到模型中的样子:

  history = model.fit(功能,结果,validation_split = 0.2,epochs = 10,batch_size = 50) 

因此,我正在将完整的数据(特征和结果)拟合到我的模型中,然后从该数据中,我使用20%的数据进行验证: validation_split = 0.2 .因此,基本上,我的模型将使用80%的数据进行训练,并在20%的数据上进行测试.

所以当我需要评估模型时,混乱就开始了:

 分数= model.evaluate(x_test,y_test,batch_size = 50) 

这是正确的吗?我的意思是,为什么我应该将数据分为训练和测试,x_train和y_train会去哪里?

您能告诉我创建模型的正确步骤是什么吗?

解决方案

通常,在训练时间( model.fit )中,您有两套:一套用于训练集,另一个用于验证/调整/开发集.使用训练集,您可以训练模型,而使用验证集,则需要找到最佳的超参数集.完成后,您可以使用看不见的数据集来测试模型-与 training validation 集不同,该数据集完全隐藏在模型中.


现在,当您使用时

  X_train,X_test,y_train,y_test = train_test_split(功能,结果,test_size = 0.33) 

通过此操作,您将功能结果拆分为 33%的数据,以进行测试 67%进行培训.现在,您可以做两件事

  1. 使用( X_test y_test 作为 model.fit(...)中的验证集.或者,
  2. 将它们用于模型中的最终预测.预测(...)


因此,如果您选择这些测试集作为验证集(数字1 ),则将执行以下操作:

  model.fit(x = X_train,y = y_trian,validation_data =(X_test,y_test),...) 

在培训日志中,您将获得验证结果以及培训分数.如果您以后计算 model.evaluate(X_test,y_test).验证结果应该是相同的.


现在,如果您选择那些测试集作为最终预测或最终评估集(数字2 ),则需要进行验证重新设置或使用 validation_split 参数,如下所示:

  model.fit(x = X_train,y = y_trian,validation_split = 0.2,...) 

Keras API将获取训练数据的 .2 百分比( X_train y_train ),并且用它来验证.最后,对于模型的最终评估,您可以执行以下操作:

  y_pred = model.predict(x_test,batch_size = 50) 

现在,您可以将 y_test y_pred 与一些相关指标进行比较.

I'm a little confused about splitting the dataset when I'm making and evaluating Keras machine learning models. Lets say that I have dataset of 1000 rows.

features = df.iloc[:,:-1]
results = df.iloc[:,-1]

Now I want to split this data into training and testing (33% of data for testing, 67% for training):

x_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)

I have read on the internet that fitting the data into model should look like this:

history = model.fit(features, results, validation_split = 0.2, epochs = 10, batch_size=50)

So I'm fitting the full data (features and results) to my model, and from that data I'm using 20% of data for validation: validation_split = 0.2. So basically, my model will be trained with 80% of data, and tested on 20% of data.

So confusion starts when I need to evaluate the model:

score = model.evaluate(x_test, y_test, batch_size=50)

Is this correct? I mean, why should I split the data into training and testing, where does x_train and y_train go?

Can you please explain to me whats the correct order of steps for creating model?

解决方案

Generally, in training time (model. fit), you have two sets: one is for the training set and another is for validation/tuning/development set. With the training set, you train the model, and with the validation set, you need to find the best set of hyper-parameter. And when you're done, you may then test your model with unseen data set - a set that was completely hidden from the model unlike the training or validation set.


Now, when you used

X_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)

By this, you split the features and results into 33% of data for testing, 67% for training. Now, you can do two things

  1. use the (X_test and y_test as validation set in model.fit(...). Or,
  2. use them for final prediction in model. predict(...)


So, if you choose these test sets as a validation set ( number 1 ), you would do as follows:

model.fit(x=X_train, y=y_trian, 
         validation_data = (X_test, y_test), ...)

In the training log, you will get the validation results along with the training score. The validation results should be the same if you later compute model.evaluate(X_test, y_test).


Now, if you choose those test set as a final prediction or final evaluation set ( number 2 ), then you need to make validation set newly or use the validation_split argument as follows:

model.fit(x=X_train, y=y_trian, 
         validation_split = 0.2, ...)

The Keras API will take the .2 percentage of the training data (X_train and y_train) and use it for validation. And lastly, for the final evaluation of your model, you can do as follows:

y_pred = model.predict(x_test, batch_size=50)

Now, you can compare with y_test and y_pred with some relevant metrics.

这篇关于制作Keras模型时将数据拆分为训练,测试和评估数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆