如何在 scikit-learn 的“pipeline"中使用自定义特征选择功能 [英] How can I use a custom feature selection function in scikit-learn's `pipeline`

查看:67
本文介绍了如何在 scikit-learn 的“pipeline"中使用自定义特征选择功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我想通过交叉验证和使用 pipeline 类比较特定(监督)数据集的不同降维方法,该数据集由 n>2 个特征组成.

Let's say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by using the pipeline class.

例如,如果我想试验 PCA 与 LDA,我可以执行以下操作:

For example, if I want to experiment with PCA vs LDA I could do something like:

from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.lda import LDA
from sklearn.decomposition import PCA

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),           
    ('classification', GaussianNB())   
    ])

clf_pca = Pipeline(steps=[
    ('scaler', StandardScaler()),    
    ('reduce_dim', PCA(n_components=2)),
    ('classification', GaussianNB())   
    ])

clf_lda = Pipeline(steps=[
    ('scaler', StandardScaler()), 
    ('reduce_dim', LDA(n_components=2)),
    ('classification', GaussianNB())   
    ])

# Constructing the k-fold cross validation iterator (k=10)  

cv = KFold(n=X_train.shape[0],  # total number of samples
           n_folds=10,           # number of folds the dataset is divided into
           shuffle=True,
           random_state=123)

scores = [
    cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
            for clf in [clf_all, clf_pca, clf_lda]
    ]

但是现在,让我们说——基于一些领域知识"——我假设特征 3 &4 可能是好特性"(数组 X_train 的第三和第四列),我想将它们与其他方法进行比较.

But now, let's say that -- based on some "domain knowledge" -- I have the hypothesis that the features 3 & 4 might be "good features" (the third and fourth column of the array X_train) and I want to compare them with the other approaches.

我如何在 pipeline 中包含这样的手动功能选择?

How would I include such a manual feature selection in the pipeline?

例如

def select_3_and_4(X_train):
    return X_train[:,2:4]

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('feature_select', select_3_and_4),           
    ('classification', GaussianNB())   
    ]) 

显然行不通.

所以我假设我必须创建一个具有 transform 虚拟方法和 fit 方法的特征选择类,该方法返回 numpy数组?或者有更好的方法吗?

So I assume I have to create a feature selection class that has a transform dummy method and fit method that returns the two columns of the numpy array?? Or is there a better way?

推荐答案

如果你想使用 Pipeline 对象,那么是的,干净的方法是编写一个转换器对象.这样做的肮脏方法是

If you want to use the Pipeline object, then yes, the clean way is to write a transformer object. The dirty way to do this is

select_3_and_4.transform = select_3_and_4.__call__
select_3_and_4.fit = lambda x: select_3_and_4

并使用 select_3_and_4,就像您在管道中一样.你显然也可以写一个类.

and use select_3_and_4 as you had it in your pipeline. You can evidently also write a class.

否则,如果您知道其他功能无关紧要,您也可以将 X_train[:, 2:4] 提供给您的管道.

Otherwise, you could also just give X_train[:, 2:4] to your pipeline if you know that the other features are irrelevant.

数据驱动的特征选择工具可能偏离主题,但总是有用的:检查例如sklearn.feature_selection.SelectKBest 使用 sklearn.feature_selection.f_classifsklearn.feature_selection.f_regression 与例如k=2 在你的情况下.

Data driven feature selection tools are maybe off-topic, but always useful: Check e.g. sklearn.feature_selection.SelectKBest using sklearn.feature_selection.f_classif or sklearn.feature_selection.f_regression with e.g. k=2 in your case.

这篇关于如何在 scikit-learn 的“pipeline"中使用自定义特征选择功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆