PySpark &MLLib:随机森林特征的重要性 [英] PySpark & MLLib: Random Forest Feature Importances

查看:35
本文介绍了PySpark &MLLib:随机森林特征的重要性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取我使用 PySpark 训练的随机森林对象的特征重要性.但是,我在文档的任何地方都没有看到这样做的示例,也不是 RandomForestModel 的方法.

I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel.

如何从 PySpark 中的 RandomForestModel 回归器或分类器中提取特征重要性?

How can I extract feature importances from a RandomForestModel regressor or classifier in PySpark?

这是文档中提供的示例代码,让我们开始;但是,其中没有提及特征重要性.

Here's the sample code provided in the documentation to get us started; however, there is no mention of feature importances in it.

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

我没有看到 model.__featureImportances_ 属性可用 -- 我在哪里可以找到它?

I don't see a model.__featureImportances_ attribute available -- where can I find this?

推荐答案

UPDATE for version > 2.0.0

从 2.0.0 版本开始,如你所见 这里,FeatureImportances 可用于随机森林.

From the version 2.0.0, as you can see here, FeatureImportances is available for Random Forest.

事实上,您可以在这里找到 那个:

In fact, you can find here that:

DataFrame API 支持两种主要的树集成算法:随机森林和梯度提升树 (GBT).两者都使用 spark.ml 决策树作为基础模型.

The DataFrame API supports two major tree ensemble algorithms: Random Forests and Gradient-Boosted Trees (GBTs). Both use spark.ml decision trees as their base models.

用户可以在 MLlib Ensemble 指南中找到有关集成算法的更多信息.在本节中,我们将演示用于集成的 DataFrame API.

Users can find more information about ensemble algorithms in the MLlib Ensemble guide. In this section, we demonstrate the DataFrame API for ensembles.

此 API 与原始 MLlib 集成 API 之间的主要区别是:

The main differences between this API and the original MLlib ensembles API are:

  • 支持数据帧和机器学习管道
  • 分类与回归的分离
  • 使用 DataFrame 元数据区分连续特征和分类特征
  • 随机森林的更多功能:特征重要性的估计,以及用于分类的每个类别的预测概率(也称为类别条件概率).
  • support for DataFrames and ML Pipelines
  • separation of classification vs. regression
  • use of DataFrame metadata to distinguish continuous and categorical features
  • more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional probabilities) for classification.

如果你想拥有特征重要性值,你必须使用 ml 包,而不是 mllib,并使用数据帧.

If you want to have Feature Importance values, you have to work with ml package, not mllib, and use dataframes.

下面有一个例子,你可以找到 这里:

Below there is an example that you can find here:

# IMPORT
>>> import numpy
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> from pyspark.ml.classification import RandomForestClassifier

# PREPARE DATA
>>> df = spark.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)

# BUILD THE MODEL
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)

# FEATURE IMPORTANCES
>>> model.featureImportances
SparseVector(1, {0: 1.0}) 

这篇关于PySpark &MLLib:随机森林特征的重要性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆