如何在 scikit-learn 的“pipeline"中使用自定义特征选择功能 [英] How can I use a custom feature selection function in scikit-learn's `pipeline`
问题描述
假设我想通过交叉验证和使用 pipeline
类比较特定(监督)数据集的不同降维方法,该数据集由 n>2 个特征组成.
Let's say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by using the pipeline
class.
例如,如果我想试验 PCA 与 LDA,我可以执行以下操作:
For example, if I want to experiment with PCA vs LDA I could do something like:
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.lda import LDA
from sklearn.decomposition import PCA
clf_all = Pipeline(steps=[
('scaler', StandardScaler()),
('classification', GaussianNB())
])
clf_pca = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', PCA(n_components=2)),
('classification', GaussianNB())
])
clf_lda = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', LDA(n_components=2)),
('classification', GaussianNB())
])
# Constructing the k-fold cross validation iterator (k=10)
cv = KFold(n=X_train.shape[0], # total number of samples
n_folds=10, # number of folds the dataset is divided into
shuffle=True,
random_state=123)
scores = [
cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
for clf in [clf_all, clf_pca, clf_lda]
]
但是现在,让我们说——基于一些领域知识"——我假设特征 3 &4 可能是好特性"(数组 X_train
的第三和第四列),我想将它们与其他方法进行比较.
But now, let's say that -- based on some "domain knowledge" -- I have the hypothesis that the features 3 & 4 might be "good features" (the third and fourth column of the array X_train
) and I want to compare them with the other approaches.
我如何在 pipeline
中包含这样的手动功能选择?
How would I include such a manual feature selection in the pipeline
?
例如
def select_3_and_4(X_train):
return X_train[:,2:4]
clf_all = Pipeline(steps=[
('scaler', StandardScaler()),
('feature_select', select_3_and_4),
('classification', GaussianNB())
])
显然行不通.
所以我假设我必须创建一个具有 transform
虚拟方法和 fit
方法的特征选择类,该方法返回 numpy的两列代码>数组?或者有更好的方法吗?
So I assume I have to create a feature selection class that has a transform
dummy method and fit
method that returns the two columns of the numpy
array?? Or is there a better way?
推荐答案
如果你想使用 Pipeline
对象,那么是的,干净的方法是编写一个转换器对象.这样做的肮脏方法是
If you want to use the Pipeline
object, then yes, the clean way is to write a transformer object. The dirty way to do this is
select_3_and_4.transform = select_3_and_4.__call__
select_3_and_4.fit = lambda x: select_3_and_4
并使用 select_3_and_4
,就像您在管道中一样.你显然也可以写一个类.
and use select_3_and_4
as you had it in your pipeline. You can evidently also write a class.
否则,如果您知道其他功能无关紧要,您也可以将 X_train[:, 2:4]
提供给您的管道.
Otherwise, you could also just give X_train[:, 2:4]
to your pipeline if you know that the other features are irrelevant.
数据驱动的特征选择工具可能偏离主题,但总是有用的:检查例如sklearn.feature_selection.SelectKBest
使用 sklearn.feature_selection.f_classif
或 sklearn.feature_selection.f_regression
与例如k=2
在你的情况下.
Data driven feature selection tools are maybe off-topic, but always useful: Check e.g. sklearn.feature_selection.SelectKBest
using sklearn.feature_selection.f_classif
or sklearn.feature_selection.f_regression
with e.g. k=2
in your case.
这篇关于如何在 scikit-learn 的“pipeline"中使用自定义特征选择功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!