sklearn中的x_test,x_train,y_test,y_train有什么区别? [英] What is the difference between x_test, x_train, y_test, y_train in sklearn?

查看:554
本文介绍了sklearn中的x_test,x_train,y_test,y_train有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习sklearn,我不十分了解它们之间的区别,以及为什么要使用带有train_test_split函数的4个输出.

在文档中,我找到了一些示例,但这还不足以结束我的疑问.

代码是使用x_train预测x_test还是使用x_train预测y_test?

培训与考试有什么区别?我会使用火车来预测测试或类似的结果吗?

我对此很困惑.我将在文档中提供的示例下面.

 >>>将numpy导入为np>>>从sklearn.model_selection导入train_test_split>>>X,y = np.arange(10).reshape((5,2)),range(5)>>>X数组([[0,1],[2,3],[4,5],[6,7],[8,9]])>>>清单(y)[0, 1, 2, 3, 4]>>>X_train,X_test,y_train,y_test = train_test_split(... X, y, test_size=0.33, random_state=42)...>>>X_train数组([[4,5],[0,1],[6,7]])>>>y_train[2,0,3]>>>X_测试数组([[2, 3],[8,9]])>>>y_test[1,4]>>>train_test_split(y,shuffle = False)[[0,1,2],[3,4]] 

解决方案

下面是一个虚拟 pandas.DataFrame ,例如:

 将pandas导入为pd从sklearn.model_selection导入train_test_split从sklearn.linear_model导入LogisticRegression从sklearn.metrics导入precision_score,confusion_matrix,classification_reportdf = pd.DataFrame({'X1':[100,120,140,​​200,230,400,500,540,600,625],'X2':[14,15,22,24,23,31,33,35,40,40],'Y':[0,0,0,0,1,1,1,1,1,1,1]}) 

这里我们有3列, X1,X2,Y 假设 X1&X2 是您的自变量,'Y' 列是您的因变量.

  X = df [['X1','X2']]y = df ['Y'] 

使用 sklearn.model_selection.train_test_split ,您将创建4部分数据,这些数据将用于拟合&预测值.

  X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.4,random_state = 42)X_train,X_test,y_train,y_test 

现在

1).X_train -包括您所有的自变量,这些变量将用于训练模型,也正如我们指定的 test_size = 0.4 一样,这意味着 60%完整数据中的观测值将用于训练/拟合模型,其余的 40%将用于测试模型.

2).X_test -这是数据中自变量的剩余 40%部分,该部分将不在训练阶段使用,并将用于进行预测以测试模型的准确性.

3).y_train -这是该模型需要预测的因变量,其中包括针对您的自变量的类别标签,我们需要在训练/拟合模型时指定我们的因变量.

4).y_test -此数据具有用于测试数据的类别标签,这些标签将用于测试实际类别和预测类别之间的准确性.

现在您可以在此数据上拟合模型,让我们拟合 sklearn.linear_model.LogisticRegression

  logreg = LogisticRegression()logreg.fit(X_train,y_train)#这是进行培训的地方y_pred_logreg = logreg.predict(X_test)#进行预测以在测试数据上测试模型print('Logistic回归训练精度%s'%logreg.score(X_train,y_train))#训练精度#Logistic回归列车精度0.8333333333333334print('Logistic回归测试准确度%s'%precision_score(y_pred_logreg,y_test))#测试准确度#Logistic回归测试准确度0.5print(confusion_matrix(y_test,y_pred_logreg))#混淆矩阵print(classification_report(y_test,y_pred_logreg))#分类报告 

您可以在此处

此处

希望这会有所帮助:)

I'm learning sklearn and I didn't understand very good the difference and why use 4 outputs with the function train_test_split.

In the Documentation, I found some examples but it wasn't sufficient to end my doubts.

Does the code use the x_train to predict the x_test or use the x_train to predict the y_test?

What is the difference between train and test? Do I use train to predict the test or something similar?

I'm very confused about it. I will let below the example provided in the Documentation.

>>> import numpy as np  
>>> from sklearn.model_selection import train_test_split  
>>> X, y = np.arange(10).reshape((5, 2)), range(5)  
>>> X
array([[0, 1], 
       [2, 3],  
       [4, 5],  
       [6, 7],  
       [8, 9]])  
>>> list(y)  
[0, 1, 2, 3, 4] 
>>> X_train, X_test, y_train, y_test = train_test_split(  
...     X, y, test_size=0.33, random_state=42)  
...  
>>> X_train  
array([[4, 5], 
       [0, 1],  
       [6, 7]])  
>>> y_train  
[2, 0, 3]  
>>> X_test  
array([[2, 3], 
       [8, 9]])  
>>> y_test  
[1, 4]  
>>> train_test_split(y, shuffle=False)  
[[0, 1, 2], [3, 4]]

解决方案

Below is a dummy pandas.DataFrame for example:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

df = pd.DataFrame({'X1':[100,120,140,200,230,400,500,540,600,625],
                       'X2':[14,15,22,24,23,31,33,35,40,40],
                       'Y':[0,0,0,0,1,1,1,1,1,1]})

Here we have 3 columns, X1,X2,Y suppose X1 & X2 are your independent variables and 'Y' column is your dependent variable.

X = df[['X1','X2']]
y = df['Y']

With sklearn.model_selection.train_test_split you are creating 4 portions of data which will be used for fitting & predicting values.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4,random_state=42) 

X_train, X_test, y_train, y_test

Now

1). X_train - This includes your all independent variables,these will be used to train the model, also as we have specified the test_size = 0.4, this means 60% of observations from your complete data will be used to train/fit the model and rest 40% will be used to test the model.

2). X_test - This is remaining 40% portion of the independent variables from the data which will not be used in the training phase and will be used to make predictions to test the accuracy of the model.

3). y_train - This is your dependent variable which needs to be predicted by this model, this includes category labels against your independent variables, we need to specify our dependent variable while training/fitting the model.

4). y_test - This data has category labels for your test data, these labels will be used to test the accuracy between actual and predicted categories.

Now you can fit a model on this data, let's fit sklearn.linear_model.LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train) #This is where the training is taking place
y_pred_logreg = logreg.predict(X_test) #Making predictions to test the model on test data
print('Logistic Regression Train accuracy %s' % logreg.score(X_train, y_train)) #Train accuracy
#Logistic Regression Train accuracy 0.8333333333333334
print('Logistic Regression Test accuracy %s' % accuracy_score(y_pred_logreg, y_test)) #Test accuracy
#Logistic Regression Test accuracy 0.5
print(confusion_matrix(y_test, y_pred_logreg)) #Confusion matrix
print(classification_report(y_test, y_pred_logreg)) #Classification Report

You can read more about metrics here

Read more about data split here

Hope this helps:)

这篇关于sklearn中的x_test,x_train,y_test,y_train有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆