列变换后的 Pyspark 随机森林特征重要性映射 [英] Pyspark random forest feature importance mapping after column transformations

查看:34
本文介绍了列变换后的 Pyspark 随机森林特征重要性映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用列名绘制某些基于树的模型的特征重要性.我正在使用 Pyspark.

I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.

因为我也有文本分类变量和数字变量,所以我不得不使用类似这样的管道方法 -

Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this -

  1. 使用字符串索引器来索引字符串列
  2. 对所有列使用一个热编码器
  3. 使用vectorassembler创建包含特征向量的特征列

  1. use string indexer to index string columns
  2. use one hot encoder for all columns
  3. use a vectorassembler to create the feature column containing the feature vector

来自 docs 步骤 1,2,3 -

Some sample code from the docs for steps 1,2,3 -

from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, 
VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status", 
"occupation", "relationship", "race", "sex", "native_country"]
 stages = [] # stages in our Pipeline
 for categoricalCol in categoricalColumns:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categoricalCol, 
    outputCol=categoricalCol + "Index")
    # Use OneHotEncoder to convert categorical variables into binary 
    SparseVectors
    # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", 
    outputCol=categoricalCol + "classVec")
    encoder = OneHotEncoderEstimator(inputCols= 
    [stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]

numericCols = ["age", "fnlwgt", "education_num", "capital_gain", 
"capital_loss", "hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

# Create a Pipeline.
pipeline = Pipeline(stages=stages)
# Run the feature transformations.
#  - fit() computes feature statistics as needed.
#  - transform() actually transforms the features.
pipelineModel = pipeline.fit(dataset)
dataset = pipelineModel.transform(dataset)

  • 最终训练模型

  • finally train the model

    在训练和评估之后,我可以使用model.featureImportances"来获得特征排名,但是我没有得到特征/列名称,而只是特征编号,就像这样 -

    after training and eval, I can use the "model.featureImportances" to get the feature rankings, however I dont get the feature/column names, rather just the feature number, something like this -

    print dtModel_1.featureImportances
    
    (38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
    

  • 如何将其映射回初始列名和值?这样我就可以绘图?**

    How do I map it back to the initial column names and the values? So that I can plot ?**

    推荐答案

    Extract metadata as show here by user6910411

    Extract metadata as shown here by user6910411

    attrs = sorted(
        (attr["idx"], attr["name"]) for attr in (chain(*dataset
            .schema["features"]
            .metadata["ml_attr"]["attrs"].values())))
    

    并结合特征重要性:

    [(name, dtModel_1.featureImportances[idx])
     for idx, name in attrs
     if dtModel_1.featureImportances[idx]]
    

    这篇关于列变换后的 Pyspark 随机森林特征重要性映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆