我的r平方得分变为负数,但使用k倍交叉验证的准确性得分达到了约92% [英] My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%

查看:653
本文介绍了我的r平方得分变为负数,但使用k倍交叉验证的准确性得分达到了约92%的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于下面的代码,我的r平方得分显示为负数,但使用k倍交叉验证的准确性得分显示为92%.这怎么可能?我使用随机森林回归算法来预测一些数据.数据集的链接在下面的链接中给出: https://www.kaggle.com/ludobenistant/hr-analytics

For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below: https://www.kaggle.com/ludobenistant/hr-analytics

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

dataset = pd.read_csv("HR_comma_sep.csv")
x = dataset.iloc[:,:-1].values   ##Independent variable
y = dataset.iloc[:,9].values     ##Dependent variable

##Encoding the categorical variables

le_x1 = LabelEncoder()
x[:,7] = le_x1.fit_transform(x[:,7])
le_x2 = LabelEncoder()
x[:,8] = le_x1.fit_transform(x[:,8])
ohe = OneHotEncoder(categorical_features = [7,8])
x = ohe.fit_transform(x).toarray()


##splitting the dataset in training and testing data

from sklearn.cross_validation import train_test_split
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(x_train, y_train)

y_pred = regressor.predict(x_test)
print(y_pred)
from sklearn.metrics import r2_score
r2_score(y_test , y_pred)

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()

推荐答案

您的问题有几个问题...

There are several issues with your question...

对于初学者来说,您犯了一个非常基本的错误:您认为您将精度用作度量标准,而在回归设置中,下面使用的实际度量标准是均方误差(MSE).

For starters, you are doing a very basic mistake: you think you are using accuracy as a metric, while you are in a regression setting and the actual metric used underneath is the mean squared error (MSE).

准确度是分类中使用的指标,它与正确分类的示例的百分比有关-请检查

Accuracy is a metric used in classification, and it has to do with the percentage of the correctly classified examples - check the Wikipedia entry for more details.

在您选择的回归变量(Random Forest)内部使用的度量包含在regressor.fit(x_train, y_train)命令的详细输出中-请注意criterion='mse'参数:

The metric used internally in your chosen regressor (Random Forest) is included in the verbose output of your regressor.fit(x_train, y_train) command - notice the criterion='mse' argument:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)

MSE是一个连续的正数,它的上限不是1,即,如果您获得0.92的值,则表示...很好,为0.92,不是为92%.

MSE is a positive continuous quantity, and it is not upper-bounded by 1, i.e. if you got a value of 0.92, this means... well, 0.92, and not 92%.

认识到,优良作法是将MSE明确包含为交叉验证的评分功能:

Knowing that, it is good practice to include explicitly the MSE as the scoring function of your cross-validation:

cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error')
cv_mse.mean()
# -2.433430574463703e-28

出于所有实际目的,该值为零-您几乎完全适合 training 设置;为了确认,这是您的 training 集合上的(再次是完美的)R平方分数:

For all practical purposes, this is zero - you fit the training set almost perfectly; for confirmation, here is the (perfect again) R-squared score on your training set:

train_pred = regressor.predict(x_train)
r2_score(y_train , train_pred)
# 1.0

但是,和往常一样,当您将模型应用于 test 集合时,关键时刻到来了;您的第二错误是,由于您使用缩放后的y_train训练了回归变量,因此在评估之前,还应该缩放y_test:

But, as always, the moment of truth comes when you apply your model on the test set; your second mistake here is that, since you train your regressor with scaled y_train, you should also scale y_test before evaluating:

y_test = sc_y.fit_transform(y_test)
r2_score(y_test , y_pred)
# 0.9998476914664215

,您在 test 集(接近1)中得到了一个非常好的R平方.

and you get a very nice R-squared in the test set (close to 1).

MSE怎么样?

from sklearn.metrics import mean_squared_error
mse_test = mean_squared_error(y_test, y_pred)
mse_test
# 0.00015230853357849051

这篇关于我的r平方得分变为负数,但使用k倍交叉验证的准确性得分达到了约92%的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆