将sklearn管道+嵌套交叉验证放在一起进行KNN回归 [英] Putting together sklearn pipeline+nested cross-validation for KNN regression
问题描述
我试图弄清楚如何为sklearn.neighbors.KNeighborsRegressor
构建工作流程,其中包括:
I'm trying to figure out how to built a workflow for sklearn.neighbors.KNeighborsRegressor
that includes:
- 标准化功能
- 特征选择(20个数字特征的最佳子集,没有特定总数)
- 交叉验证1至20范围内的超参数K
- 交叉验证模型
- 将RMSE用作错误指标
在scikit-learn中有很多不同的选项,我在决定我需要的课程时有点不知所措.
There's so many different options in scikit-learn that I'm a bit overwhelmed trying to decide which classes I need.
除了sklearn.neighbors.KNeighborsRegressor
,我认为我需要:
sklearn.pipeline.Pipeline
sklearn.preprocessing.Normalizer
sklearn.model_selection.GridSearchCV
sklearn.model_selection.cross_val_score
sklearn.feature_selection.selectKBest
OR
sklearn.feature_selection.SelectFromModel
有人可以告诉我定义这个管道/工作流程的样子吗?我认为应该是这样的:
Would someone please show me what defining this pipeline/workflow might look like? I think it should be something like this:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
# build regression pipeline
pipeline = Pipeline([('normalize', Normalizer()),
('kbest', SelectKBest(f_classif)),
('regressor', KNeighborsRegressor())])
# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k': list(range(1, X.shape[1]+1)),
'regressor__n_neighbors': list(range(1,21))}
# outer cross-validation on model, inner cross-validation on hyperparameters
scores = cross_val_score(GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10),
X, y, cv=10, scoring="neg_mean_squared_error", verbose=2)
rmses = np.abs(scores)**(1/2)
avg_rmse = np.mean(rmses)
print(avg_rmse)
似乎并没有出错,但是我的一些担忧是:
It doesn't seem to error out, but a few of my concerns are:
- 我是否正确执行了嵌套的交叉验证,以便我的RMSE不偏不倚?
- 如果我希望根据最佳RMSE选择最终模型,我是否应该对
cross_val_score
和GridSearchCV
都使用scoring="neg_mean_squared_error"
? -
SelectKBest, f_classif
是用于选择KNeighborsRegressor
模型的功能的最佳选择吗? - 我怎么看:
- 哪个功能子集被选为最佳
- 哪个K被选为最佳
- Did I perform the nested cross-validation properly so that my RMSE is unbiased?
- If I want the final model to be selected according to the best RMSE, am I supposed to use
scoring="neg_mean_squared_error"
for bothcross_val_score
andGridSearchCV
? - Is
SelectKBest, f_classif
the best option to use for selecting features for theKNeighborsRegressor
model? - How can I see:
- which subset of features was selected as best
- which K was selected as best
任何帮助将不胜感激!
推荐答案
您的代码似乎还可以.
Your code seems okay.
对于
cross_val_score
和GridSearchCV
的scoring="neg_mean_squared_error"
,我将执行相同的操作以确保一切正常,但是测试此方法的唯一方法是删除两者之一,然后查看结果是否更改For the
scoring="neg_mean_squared_error"
for bothcross_val_score
andGridSearchCV
, I would do the same to make sure things run fine but the only way to test this is to remove the one of the two and see if the results change.SelectKBest
是一个很好的方法,但是您也可以使用SelectFromModel
甚至其他可以找到此处SelectKBest
is a good approach but you can also useSelectFromModel
or even other methods that you can find here最后,为了获得最佳参数和功能得分,我对您的代码做了如下修改:
Finally, in order to get the best parameters and the features scores I modified a bit your code as follows:
import ... pipeline = Pipeline([('normalize', Normalizer()), ('kbest', SelectKBest(f_classif)), ('regressor', KNeighborsRegressor())]) # try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features) parameters = {'kbest__k': list(range(1, X.shape[1]+1)), 'regressor__n_neighbors': list(range(1,21))} # changes here grid = GridSearchCV(pipeline, parameters, cv=10, scoring="neg_mean_squared_error") grid.fit(X, y) # get the best parameters and the best estimator print("the best estimator is \n {} ".format(grid.best_estimator_)) print("the best parameters are \n {}".format(grid.best_params_)) # get the features scores rounded in 2 decimals pip_steps = grid.best_estimator_.named_steps['kbest'] features_scores = ['%.2f' % elem for elem in pip_steps.scores_ ] print("the features scores are \n {}".format(features_scores)) feature_scores_pvalues = ['%.3f' % elem for elem in pip_steps.pvalues_] print("the feature_pvalues is \n {} ".format(feature_scores_pvalues)) # create a tuple of feature names, scores and pvalues, name it "features_selected_tuple" featurelist = ['age', 'weight'] features_selected_tuple=[(featurelist[i], features_scores[i], feature_scores_pvalues[i]) for i in pip_steps.get_support(indices=True)] # Sort the tuple by score, in reverse order features_selected_tuple = sorted(features_selected_tuple, key=lambda feature: float(feature[1]) , reverse=True) # Print print 'Selected Features, Scores, P-Values' print features_selected_tuple
使用我的数据的结果:
the best estimator is Pipeline(steps=[('normalize', Normalizer(copy=True, norm='l2')), ('kbest', SelectKBest(k=2, score_func=<function f_classif at 0x0000000004ABC898>)), ('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=18, p=2, weights='uniform'))]) the best parameters are {'kbest__k': 2, 'regressor__n_neighbors': 18} the features scores are ['8.98', '8.80'] the feature_pvalues is ['0.000', '0.000'] Selected Features, Scores, P-Values [('correlation', '8.98', '0.000'), ('gene', '8.80', '0.000')]
这篇关于将sklearn管道+嵌套交叉验证放在一起进行KNN回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!