scikit学习中的交叉验证指标用于每个数据拆分 [英] Cross-validation metrics in scikit-learn for each data split

查看：84 发布时间：2020/10/11 19:44:31 python scikit-learn cross-validation

本文介绍了scikit学习中的交叉验证指标用于每个数据拆分的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要为（X_test，y_test）数据的每个拆分明确获取交叉验证统计信息。

I need to get the cross-validation statistics explicitly for each split of the (X_test, y_test) data.

因此，我尝试这样做：

kf = KFold(n_splits=n_splits)

X_train_tmp = []
y_train_tmp = []
X_test_tmp = []
y_test_tmp = []
mae_train_cv_list = []
mae_test_cv_list = []

for train_index, test_index in kf.split(X_train):
    
    for i in range(len(train_index)):
        X_train_tmp.append(X_train[train_index[i]])
        y_train_tmp.append(y_train[train_index[i]])

    for i in range(len(test_index)):
        X_test_tmp.append(X_train[test_index[i]])
        y_test_tmp.append(y_train[test_index[i]])

    model.fit(X_train_tmp, y_train_tmp) # FIT the model = SVR, NN, etc.

    mae_train_cv_list.append( mean_absolute_error(y_train_tmp, model.predict(X_train_tmp)) # MAE of the train part of the KFold.

    mae_test_cv_list.append( mean_absolute_error(y_test_tmp, model.predict(X_test_tmp)) ) # MAE of the test part of the KFold.

    X_train_tmp = []
    y_train_tmp = []
    X_test_tmp = []
    y_test_tmp = []

是否是通过使用例如KFold来获得每个交叉验证拆分的平均绝对误差（MAE）的正确方法？

Is it the proper way of getting the Mean Absolute Error (MAE) for each cross-validation split by using, for instance, KFold?

推荐答案

您的方法存在一些问题。

There are some issues with your approach.

首先，您不必附加在培训中一对一地数据验证列表（即您的2个内部 for 循环）；

To start with, you certainly don't have to append the data manually one by one in your training & validation lists (i.e. your 2 inner for loops); simple indexing will do the job.

此外，我们通常从不计算&报告培训简历折叠的错误-仅验证折叠的错误。

Additionally, we normally never compute & report the error of the training CV folds - only the error on the validation folds.

请牢记这些，并将术语切换为验证而不是测试，这是一个使用波士顿数据的简单可重现示例，应该很容易地适应您的情况：

Keeping these in mind, and switching the terminology to "validation" instead of "test", here is a simple reproducible example using the Boston data, which should be straighforward to adapt to your case:

from sklearn.model_selection import KFold
from sklearn.datasets import load_boston
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

X, y = load_boston(return_X_y=True)
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)
model = DecisionTreeRegressor(criterion='mae')

cv_mae = []

for train_index, val_index in kf.split(X):
    model.fit(X[train_index], y[train_index])
    pred = model.predict(X[val_index])
    err = mean_absolute_error(y[val_index], pred)
    cv_mae.append(err)

之后，您的 cv_mae 应该类似于（由于CV的随机性，细节会有所不同）：

after which, your cv_mae should be something like (details will differ due to the random nature of CV):

[3.5294117647058827,
 3.3039603960396042,
 3.5306930693069307,
 2.6910891089108913,
 3.0663366336633664]

当然，所有这些显式的东西并不是必需的。您可以使用 cross_val_score 。不过有一个小问题：

Of course, all this explicit stuff is not really necessary; you could do the job much more simply with cross_val_score. There is a small catch though:

from sklearn.model_selection import cross_val_score
cv_mae2 =cross_val_score(model, X, y, cv=n_splits, scoring="neg_mean_absolute_error")
cv_mae2
# result
array([-2.94019608, -3.71980198, -4.92673267, -4.5990099 , -4.22574257])

除了负号（这不是真正的问题）之外，您还会注意到结果的方差看起来比到上面的 cv_mae ；原因是我们没有改组数据。不幸的是， cross_val_score 不提供改组选项，因此我们必须使用 shuffle 手动进行。所以我们的最终代码应该是：

Apart from the negative sign which is not really an issue, you'll notice that the variance of the results looks significantly higher compared to our cv_mae above; and the reason is that we didn't shuffle our data. Unfortunately, cross_val_score does not provide a shuffling option, so we have to do this manually using shuffle. So our final code should be:

from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
X_s, y_s =shuffle(X, y)
cv_mae3 =cross_val_score(model, X_s, y_s, cv=n_splits, scoring="neg_mean_absolute_error")
cv_mae3
# result:
array([-3.24117647, -3.57029703, -3.10891089, -3.45940594, -2.78316832])

其中褶皱之间的差异明显较小，并且更接近我们最初的 cv_mae ...

which is of significantly less variance between the folds, and much closer to our initial cv_mae...

这篇关于scikit学习中的交叉验证指标用于每个数据拆分的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

scikit学习中的交叉验证指标用于每个数据拆分 [英] Cross-validation metrics in scikit-learn for each data split

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

scikit学习中的交叉验证指标用于每个数据拆分 [英] Cross-validation metrics in scikit-learn for each data split

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭