如何分割训练和测试数据-Tensorflow上的Keras [英] How training and test data is split - Keras on Tensorflow

查看:107
本文介绍了如何分割训练和测试数据-Tensorflow上的Keras的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用神经网络和拟合函数来训练我的数据.

I am currently training my data using neural network and using fit function.

history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)

现在,我已经将validation_split用作20%.据我了解,我的训练数据将是80%,测试数据将是20%.我很困惑如何在后端处理这些数据.就像是要抽取前80%的样本进行训练而抽取低于20%的样本进行测试,还是从中间抽取样本?如果我想提供单独的培训和测试数据,我该如何使用fit()做到这一点?

Now I have used validation_split as 20%. What I understood is that my training data will be 80% and testing data will be 20%. I am confused how this data is dealt on back end. Is it like top 80% samples will be taken for training and below 20% percent for testing or the samples are randomly picked from inbetween? If I want to give separate training and testing data, how will I do that using fit()??

此外,我的第二个担心是如何检查数据是否适合模型?从结果中可以看到,训练精度约为90%,而验证精度约为55%.这是否意味着过拟合或欠拟合?

Moreover, my second concern is how to check if data is fitting well on model? I can see from the results that training accuracy is around 90% while the validation accuracy is around 55%. Does this mean it is the case of over-fitting or Under-fitting?

我的最后一个问题是评估收益是什么?文档说它返回了损失,但是我已经在每个时期都得到了损失和准确性(作为fit()的返回(历史记录中)).评价返回的准确性和分数显示了什么?如果评估返回的准确率返回90%,我可以说我的数据很合适,而不管每个时期的个人准确度和损失是多少?

My last question is what does evaluate returns? Document says it returns the loss but I am already getting loss and accuracy during each epoch (as a return of fit() (in history)). What does accuracy and score returned by evaluate shows? If the accuracy returned by evaluate returns 90%, can I say my data is fitting well, regardless of what individual accuracy and loss was for each epoch?

下面是我的代码:

import numpy
import pandas
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import np_utils
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import itertools

seed = 7
numpy.random.seed(seed)

dataframe = pandas.read_csv("INPUTFILE.csv", skiprows=range(0, 0))

dataset = dataframe.values
X = dataset[:,0:50].astype(float) # number of cols-1
Y = dataset[:,50]

encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

encoded_Y = np_utils.to_categorical(encoded_Y)
print("encoded_Y=", encoded_Y) 
# baseline model
def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
    model.add(Dense(5, kernel_initializer='normal', activation='relu'))
    #model.add(Dense(2, kernel_initializer='normal', activation='sigmoid'))

    model.add(Dense(2, kernel_initializer='normal', activation='softmax'))

    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  # for binayr classification
        #model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])  # for multi class
    return model


model=create_baseline();
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)

print(history.history.keys())
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


pre_cls=model.predict_classes(X)    
cm1 = confusion_matrix(encoder.transform(Y),pre_cls)
print('Confusion Matrix : \n')
print(cm1)


score, acc = model.evaluate(X,encoded_Y)
print('Test score:', score)
print('Test accuracy:', acc)

推荐答案

  1. keras文档说:验证数据是从所提供的x和y数据中的最后一个样本中选择的,在重排之前.",这意味着重排发生在拆分之后,并且还有一个布尔参数称为随机播放",默认设置为true,因此,如果您不希望对数据进行随机播放,可以将其设置为false

  1. The keras documentation says:"The validation data is selected from the last samples in the x and y data provided, before shuffling.", this means that the shuffle occurs after the split, there is also a boolean parameter called "shuffle" which is set true as default, so if you don't want your data to be shuffled you could just set it to false

在训练数据上获得良好的结果,然后在评估数据上获得不良或不那么好的结果,通常意味着您的模型过度拟合,过度拟合是您的模型在非常特定的情况下学习而无法实现的在新数据上取得不错的结果

Getting good results on your training data and then getting bad or not so good results on your evaluation data usually means that your model is overfitting, overfit is when your model learns in a very specific scenario and can't achieve good results on new data

评估是根据从未见过的"新数据测试模型,通常将您的数据划分为训练和测试,但有时您可能还希望创建第三组数据,因为如果您只是调整模型以在测试数据上获得越来越好的结果,这在某种程度上就像作弊,因为您在某种程度上告诉模型要用于评估的数据如何,这可能会导致过拟合

evaluation is to test your model on new data that it has "never seen before", usually you divide your data on training and test, but sometimes you might also want to create a third group of data, because if you just adjust your model to obtain better and better results on your test data this in some way is like cheating because in some way you are telling your model how is the data you are going to use for evaluation and this could cause overfitting

此外,如果您想不使用keras拆分数据,我建议您使用sklearn train_test_split()函数.

Also, if you want to split your data without using keras, I recommend you to use the sklearn train_test_split() function.

它易于使用,看起来像这样:

it's easy to use and it looks like this:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

这篇关于如何分割训练和测试数据-Tensorflow上的Keras的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆