PySpark& MLLib:随机森林特征的重要性 [英] PySpark & MLLib: Random Forest Feature Importances

查看:488
本文介绍了PySpark& MLLib:随机森林特征的重要性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取我使用PySpark训练的随机森林对象的功能重要性.但是,我在文档的任何地方都没有看到执行此操作的示例,它也不是RandomForestModel的方法.

I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel.

如何从PySpark中的RandomForestModel回归器或分类器中提取功能重要性?

How can I extract feature importances from a RandomForestModel regressor or classifier in PySpark?

以下是文档中提供的示例代码,可以帮助我们入门;但是,其中没有提及功能的重要性.

Here's the sample code provided in the documentation to get us started; however, there is no mention of feature importances in it.

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

我看不到model.__featureImportances_属性可用-在哪里可以找到它?

I don't see a model.__featureImportances_ attribute available -- where can I find this?

推荐答案

版本> 2.0.0的更新

从2.0.0版开始,您可以看到

From the version 2.0.0, as you can see here, FeatureImportances is available for Random Forest.

实际上,您可以在此处该:

DataFrame API支持两种主要的树集成算法:随机森林和梯度增强树(GBT).两者都使用spark.ml决策树作为其基本模型.

The DataFrame API supports two major tree ensemble algorithms: Random Forests and Gradient-Boosted Trees (GBTs). Both use spark.ml decision trees as their base models.

用户可以在MLlib Ensemble指南中找到有关集成算法的更多信息. 在本节中,我们演示用于集成的DataFrame API.

Users can find more information about ensemble algorithms in the MLlib Ensemble guide. In this section, we demonstrate the DataFrame API for ensembles.

此API与原始MLlib集成API之间的主要区别是:

The main differences between this API and the original MLlib ensembles API are:

  • 对DataFrames和ML管道的支持
  • 分类与回归的分离
  • 使用DataFrame元数据区分连续特征和分类特征
  • 随机森林的更多功能:特征重要性的估算,以及每个分类的预测概率(又称分类条件概率).
  • support for DataFrames and ML Pipelines
  • separation of classification vs. regression
  • use of DataFrame metadata to distinguish continuous and categorical features
  • more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional probabilities) for classification.

如果要具有要素重要性"值,则必须使用 ml 软件包(而不是 mllib ),并使用数据框.

If you want to have Feature Importance values, you have to work with ml package, not mllib, and use dataframes.

下面有一个示例,您可以找到此处:

Below there is an example that you can find here:

# IMPORT
>>> import numpy
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> from pyspark.ml.classification import RandomForestClassifier

# PREPARE DATA
>>> df = spark.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)

# BUILD THE MODEL
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)

# FEATURE IMPORTANCES
>>> model.featureImportances
SparseVector(1, {0: 1.0}) 

这篇关于PySpark& MLLib:随机森林特征的重要性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆