如何在 scikit-learn 的“pipeline"中使用自定义特征选择功能 [英] How can I use a custom feature selection function in scikit-learn's `pipeline`

查看：67 发布时间：2021/7/16 19:59:26 python scikit-learn

本文介绍了如何在 scikit-learn 的“pipeline"中使用自定义特征选择功能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我想通过交叉验证和使用 pipeline 类比较特定(监督)数据集的不同降维方法，该数据集由 n>2 个特征组成.

Let's say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by using the pipeline class.

例如，如果我想试验 PCA 与 LDA，我可以执行以下操作:

For example, if I want to experiment with PCA vs LDA I could do something like:

from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.lda import LDA
from sklearn.decomposition import PCA

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),           
    ('classification', GaussianNB())   
    ])

clf_pca = Pipeline(steps=[
    ('scaler', StandardScaler()),    
    ('reduce_dim', PCA(n_components=2)),
    ('classification', GaussianNB())   
    ])

clf_lda = Pipeline(steps=[
    ('scaler', StandardScaler()), 
    ('reduce_dim', LDA(n_components=2)),
    ('classification', GaussianNB())   
    ])

# Constructing the k-fold cross validation iterator (k=10)  

cv = KFold(n=X_train.shape[0],  # total number of samples
           n_folds=10,           # number of folds the dataset is divided into
           shuffle=True,
           random_state=123)

scores = [
    cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
            for clf in [clf_all, clf_pca, clf_lda]
    ]

但是现在，让我们说——基于一些领域知识"——我假设特征 3 &4 可能是好特性"(数组 X_train 的第三和第四列)，我想将它们与其他方法进行比较.

But now, let's say that -- based on some "domain knowledge" -- I have the hypothesis that the features 3 & 4 might be "good features" (the third and fourth column of the array X_train) and I want to compare them with the other approaches.

我如何在 pipeline 中包含这样的手动功能选择?

How would I include such a manual feature selection in the pipeline?

例如

def select_3_and_4(X_train):
    return X_train[:,2:4]

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('feature_select', select_3_and_4),           
    ('classification', GaussianNB())   
    ])

显然行不通.

所以我假设我必须创建一个具有 transform 虚拟方法和 fit 方法的特征选择类，该方法返回 numpy数组?或者有更好的方法吗?


So I assume I have to create a feature selection class that has a transform dummy method and fit method that returns the two columns of the numpy array?? Or is there a better way?
推荐答案
如果你想使用 Pipeline 对象，那么是的，干净的方法是编写一个转换器对象.这样做的肮脏方法是
If you want to use the Pipeline object, then yes, the clean way is to write a transformer object. The dirty way to do this is
select_3_and_4.transform = select_3_and_4.__call__
select_3_and_4.fit = lambda x: select_3_and_4

并使用 select_3_and_4，就像您在管道中一样.你显然也可以写一个类.
and use select_3_and_4 as you had it in your pipeline. You can evidently also write a class.
否则，如果您知道其他功能无关紧要，您也可以将 X_train[:, 2:4] 提供给您的管道.
Otherwise, you could also just give X_train[:, 2:4] to your pipeline if you know that the other features are irrelevant.
数据驱动的特征选择工具可能偏离主题，但总是有用的:检查例如sklearn.feature_selection.SelectKBest 使用 sklearn.feature_selection.f_classif 或 sklearn.feature_selection.f_regression 与例如k=2 在你的情况下.
Data driven feature selection tools are maybe off-topic, but always useful: Check e.g. sklearn.feature_selection.SelectKBest using sklearn.feature_selection.f_classif or sklearn.feature_selection.f_regression with e.g. k=2 in your case.

                        这篇关于如何在 scikit-learn 的“pipeline"中使用自定义特征选择功能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何在 scikit-learn 的“pipeline"中使用自定义特征选择功能 [英] How can I use a custom feature selection function in scikit-learn's `pipeline`

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在 scikit-learn 的“pipeline"中使用自定义特征选择功能 [英] How can I use a custom feature selection function in scikit-learn&#39;s `pipeline`

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

如何在 scikit-learn 的“pipeline"中使用自定义特征选择功能 [英] How can I use a custom feature selection function in scikit-learn's `pipeline`

登录关闭