如何在SelectFromModel()中确定用于选择特征的阈值? [英] How to decide threshold value in SelectFromModel() for selecting features?

查看:1046
本文介绍了如何在SelectFromModel()中确定用于选择特征的阈值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用随机森林分类器进行特征选择.我总共有70个功能,并且我要从70个功能中选择最重要的功能.下面的代码显示了分类器,从最重要到最不重要显示了这些功能.

I am using random forest classifier for feature selection. I have 70 features in all and I want to select the most important features out of 70. Below code shows the classifier displaying the features from most significant to least significant.

代码:

feat_labels = data.columns[1:]
clf = RandomForestClassifier(n_estimators=100, random_state=0)

# Train the classifier
clf.fit(X_train, y_train)

importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))  

现在我正尝试使用sklearn.feature_selection中的SelectFromModel,但是如何确定给定数据集的阈值.

Now I am trying to use SelectFromModel from sklearn.feature_selection but how can I decide the threshold value for my given dataset.

# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)

# Train the selector
sfm.fit(X_train, y_train)

当我尝试threshold=0.15然后尝试训练我的模型时,出现错误,提示数据太嘈杂或选择太严格.

When I try threshold=0.15 and then try to train my model I get an error saying data is too noisy or the selection is too strict.

但是,如果我使用threshold=0.015,则可以在选定的新功能上训练我的模型,那么如何确定该阈值?

But if I use threshold=0.015 I am able to train my model on selected new features So how can I decide this threshold value ?

推荐答案

我会尝试以下方法:

  1. 以较低的阈值开始,例如:1e-4
  2. 使用SelectFromModel fit&缩小功能转换
  3. 针对所选功能为估算器(在您的情况下为RandomForestClassifier)计算指标(准确性等)
  4. 提高阈值并重复从第1点开始的所有步骤.
  1. start with a low threshold, for example: 1e-4
  2. reduce your features using SelectFromModel fit & transform
  3. compute metrics (accuracy, etc.) for your estimator (RandomForestClassifier in your case) for selected features
  4. increase threshold and repeat all steps starting from point 1.

使用这种方法,您可以估算出最适合您的特定数据和估算器的threshold

Using this approach you can estimate what is the best threshold for your particular data and your estimator

这篇关于如何在SelectFromModel()中确定用于选择特征的阈值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆