如何拆分训练和测试数据 - Tensorflow 上的 Keras [英] How training and test data is split - Keras on Tensorflow

查看:35
本文介绍了如何拆分训练和测试数据 - Tensorflow 上的 Keras的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用神经网络和拟合函数训练我的数据.

I am currently training my data using neural network and using fit function.

history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)

现在我使用了validation_split 作为20%.我的理解是我的训练数据将是 80%,测试数据将是 20%.我很困惑如何在后端处理这些数据.是将前 80% 的样本用于训练,将低于 20% 的样本用于测试还是从中间随机抽取样本?如果我想提供单独的训练和测试数据,我将如何使用 fit() 来做到这一点?

Now I have used validation_split as 20%. What I understood is that my training data will be 80% and testing data will be 20%. I am confused how this data is dealt on back end. Is it like top 80% samples will be taken for training and below 20% percent for testing or the samples are randomly picked from inbetween? If I want to give separate training and testing data, how will I do that using fit()??

此外,我的第二个问题是如何检查数据是否适合模型?从结果中可以看出,训练准确度在 90% 左右,而验证准确度在 55% 左右.这是否意味着过拟合或欠拟合的情况?

Moreover, my second concern is how to check if data is fitting well on model? I can see from the results that training accuracy is around 90% while the validation accuracy is around 55%. Does this mean it is the case of over-fitting or Under-fitting?

我的最后一个问题是什么是评估回报?文档说它返回了损失,但我已经在每个时期都得到了损失和准确性(作为 fit() 的回报(在历史上)).评估返回的准确性和分数显示什么?如果评估返回的准确率返回 90%,我能说我的数据拟合得很好,不管每个 epoch 的单独准确率和损失是多少?

My last question is what does evaluate returns? Document says it returns the loss but I am already getting loss and accuracy during each epoch (as a return of fit() (in history)). What does accuracy and score returned by evaluate shows? If the accuracy returned by evaluate returns 90%, can I say my data is fitting well, regardless of what individual accuracy and loss was for each epoch?

下面是我的代码:

import numpy
import pandas
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import np_utils
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import itertools

seed = 7
numpy.random.seed(seed)

dataframe = pandas.read_csv("INPUTFILE.csv", skiprows=range(0, 0))

dataset = dataframe.values
X = dataset[:,0:50].astype(float) # number of cols-1
Y = dataset[:,50]

encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

encoded_Y = np_utils.to_categorical(encoded_Y)
print("encoded_Y=", encoded_Y) 
# baseline model
def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
    model.add(Dense(5, kernel_initializer='normal', activation='relu'))
    #model.add(Dense(2, kernel_initializer='normal', activation='sigmoid'))

    model.add(Dense(2, kernel_initializer='normal', activation='softmax'))

    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  # for binayr classification
        #model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])  # for multi class
    return model


model=create_baseline();
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)

print(history.history.keys())
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


pre_cls=model.predict_classes(X)    
cm1 = confusion_matrix(encoder.transform(Y),pre_cls)
print('Confusion Matrix : 
')
print(cm1)


score, acc = model.evaluate(X,encoded_Y)
print('Test score:', score)
print('Test accuracy:', acc)

推荐答案

  1. keras 文档说:验证数据是从提供的 x 和 y 数据中的最后一个样本中选择的,在 shuffle 之前.",这意味着 shuffle 发生在拆分之后,还有一个名为shuffle"的布尔参数默认设置为 true,因此如果您不希望对数据进行混洗,则可以将其设置为 false

  1. The keras documentation says:"The validation data is selected from the last samples in the x and y data provided, before shuffling.", this means that the shuffle occurs after the split, there is also a boolean parameter called "shuffle" which is set true as default, so if you don't want your data to be shuffled you could just set it to false

在你的训练数据上得到好的结果,然后在你的评估数据上得到不好或不太好的结果通常意味着你的模型过度拟合,过度拟合是当你的模型在一个非常特定的场景中学习并且无法实现新数据的好结果

Getting good results on your training data and then getting bad or not so good results on your evaluation data usually means that your model is overfitting, overfit is when your model learns in a very specific scenario and can't achieve good results on new data

评估是在以前从未见过"的新数据上测试您的模型,通常您将训练和测试数据分开,但有时您可能还想创建第三组数据,因为如果您只是调整模型以在测试数据上获得越来越好的结果,这在某种程度上就像作弊,因为在某种程度上,您是在告诉模型您将用于评估的数据如何,这可能会导致过度拟合

evaluation is to test your model on new data that it has "never seen before", usually you divide your data on training and test, but sometimes you might also want to create a third group of data, because if you just adjust your model to obtain better and better results on your test data this in some way is like cheating because in some way you are telling your model how is the data you are going to use for evaluation and this could cause overfitting

另外,如果你想在不使用 keras 的情况下拆分数据,我建议你使用 sklearn train_test_split() 函数.

Also, if you want to split your data without using keras, I recommend you to use the sklearn train_test_split() function.

使用起来很简单,看起来像这个:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

这篇关于如何拆分训练和测试数据 - Tensorflow 上的 Keras的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆