将sklearn管道+嵌套交叉验证放在一起进行KNN回归 [英] Putting together sklearn pipeline+nested cross-validation for KNN regression

查看:108
本文介绍了将sklearn管道+嵌套交叉验证放在一起进行KNN回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图弄清楚如何为sklearn.neighbors.KNeighborsRegressor构建工作流程,其中包括:

I'm trying to figure out how to built a workflow for sklearn.neighbors.KNeighborsRegressor that includes:

  • 标准化功能
  • 特征选择(20个数字特征的最佳子集,没有特定总数)
  • 交叉验证1至20范围内的超参数K
  • 交叉验证模型
  • 将RMSE用作错误指标

在scikit-learn中有很多不同的选项,我在决定我需要的课程时有点不知所措.

There's so many different options in scikit-learn that I'm a bit overwhelmed trying to decide which classes I need.

除了sklearn.neighbors.KNeighborsRegressor,我认为我需要:

sklearn.pipeline.Pipeline  
sklearn.preprocessing.Normalizer
sklearn.model_selection.GridSearchCV
sklearn.model_selection.cross_val_score

sklearn.feature_selection.selectKBest
OR
sklearn.feature_selection.SelectFromModel

有人可以告诉我定义这个管道/工作流程的样子吗?我认为应该是这样的:

Would someone please show me what defining this pipeline/workflow might look like? I think it should be something like this:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV

# build regression pipeline
pipeline = Pipeline([('normalize', Normalizer()),
                     ('kbest', SelectKBest(f_classif)),
                     ('regressor', KNeighborsRegressor())])

# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k':  list(range(1, X.shape[1]+1)),
              'regressor__n_neighbors': list(range(1,21))}

# outer cross-validation on model, inner cross-validation on hyperparameters
scores = cross_val_score(GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10), 
                         X, y, cv=10, scoring="neg_mean_squared_error", verbose=2)

rmses = np.abs(scores)**(1/2)
avg_rmse = np.mean(rmses)
print(avg_rmse)

似乎并没有出错,但是我的一些担忧是:

It doesn't seem to error out, but a few of my concerns are:

  • 我是否正确执行了嵌套的交叉验证,以便我的RMSE不偏不倚?
  • 如果我希望根据最佳RMSE选择最终模型,我是否应该对cross_val_scoreGridSearchCV都使用scoring="neg_mean_squared_error"?
  • SelectKBest, f_classif是用于选择KNeighborsRegressor模型的功能的最佳选择吗?
  • 我怎么看:
    • 哪个功能子集被选为最佳
    • 哪个K被选为最佳
    • Did I perform the nested cross-validation properly so that my RMSE is unbiased?
    • If I want the final model to be selected according to the best RMSE, am I supposed to use scoring="neg_mean_squared_error" for both cross_val_score and GridSearchCV?
    • Is SelectKBest, f_classif the best option to use for selecting features for the KNeighborsRegressor model?
    • How can I see:
      • which subset of features was selected as best
      • which K was selected as best

      任何帮助将不胜感激!

      推荐答案

      您的代码似乎还可以.

      Your code seems okay.

      对于cross_val_scoreGridSearchCVscoring="neg_mean_squared_error",我将执行相同的操作以确保一切正常,但是测试此方法的唯一方法是删除两者之一,然后查看结果是否更改

      For the scoring="neg_mean_squared_error" for both cross_val_score and GridSearchCV, I would do the same to make sure things run fine but the only way to test this is to remove the one of the two and see if the results change.

      SelectKBest是一个很好的方法,但是您也可以使用SelectFromModel甚至其他可以找到此处

      SelectKBest is a good approach but you can also use SelectFromModel or even other methods that you can find here

      最后,为了获得最佳参数功能得分,我对您的代码做了如下修改:

      Finally, in order to get the best parameters and the features scores I modified a bit your code as follows:

      import ...
      
      
      pipeline = Pipeline([('normalize', Normalizer()),
                           ('kbest', SelectKBest(f_classif)),
                           ('regressor', KNeighborsRegressor())])
      
      # try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
      parameters = {'kbest__k':  list(range(1, X.shape[1]+1)),
                    'regressor__n_neighbors': list(range(1,21))}
      
      # changes here
      
      grid = GridSearchCV(pipeline, parameters, cv=10, scoring="neg_mean_squared_error")
      
      grid.fit(X, y)
      
      # get the best parameters and the best estimator
      print("the best estimator is \n {} ".format(grid.best_estimator_))
      print("the best parameters are \n {}".format(grid.best_params_))
      
      # get the features scores rounded in 2 decimals
      pip_steps = grid.best_estimator_.named_steps['kbest']
      
      features_scores = ['%.2f' % elem for elem in pip_steps.scores_ ]
      print("the features scores are \n {}".format(features_scores))
      
      feature_scores_pvalues = ['%.3f' % elem for elem in pip_steps.pvalues_]
      print("the feature_pvalues is \n {} ".format(feature_scores_pvalues))
      
      # create a tuple of feature names, scores and pvalues, name it "features_selected_tuple"
      
      featurelist = ['age', 'weight']
      
      features_selected_tuple=[(featurelist[i], features_scores[i], 
      feature_scores_pvalues[i]) for i in pip_steps.get_support(indices=True)]
      
      # Sort the tuple by score, in reverse order
      
      features_selected_tuple = sorted(features_selected_tuple, key=lambda 
      feature: float(feature[1]) , reverse=True)
      
      # Print
      print 'Selected Features, Scores, P-Values'
      print features_selected_tuple
      

      使用我的数据的结果:

      the best estimator is
      Pipeline(steps=[('normalize', Normalizer(copy=True, norm='l2')), ('kbest', SelectKBest(k=2, score_func=<function f_classif at 0x0000000004ABC898>)), ('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
               metric_params=None, n_jobs=1, n_neighbors=18, p=2,
               weights='uniform'))])
      
      the best parameters are
      {'kbest__k': 2, 'regressor__n_neighbors': 18}
      
      the features scores are
      ['8.98', '8.80']
      
      the feature_pvalues is
      ['0.000', '0.000']
      
      Selected Features, Scores, P-Values
      [('correlation', '8.98', '0.000'), ('gene', '8.80', '0.000')]
      

      这篇关于将sklearn管道+嵌套交叉验证放在一起进行KNN回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆