从Spark(pyspark)的管道中的StringIndexer阶段获取标签 [英] Getting labels from StringIndexer stages within pipeline in Spark (pyspark)

查看:619
本文介绍了从Spark(pyspark)的管道中的StringIndexer阶段获取标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Sparkpyspark,并且我设置了pipeline并带有一堆StringIndexer对象,这些对象用于将字符串列编码为索引列:

I am using Spark and pyspark and I have a pipeline set up with a bunch of StringIndexer objects, that I use to encode the string columns to columns of indices:

indexers = [StringIndexer(inputCol=column, outputCol=column + '_index').setHandleInvalid('skip')
            for column in list(set(data_frame.columns) - ignore_columns)]
pipeline = Pipeline(stages=indexers)
new_data_frame = pipeline.fit(data_frame).transform(data_frame)

问题是,我需要为每个StringIndexer对象装配好标签后,再获取它们的标签列表.对于单列和没有管道的单个StringIndexer来说,这是一件容易的事.在将索引器安装在DataFrame上之后,我可以访问labels属性:

The problem is, that I need to get the list of labels for each StringIndexer object after it gets fitted. For a single column and a single StringIndexer without a pipeline, it's an easy task. I can just access the labels attribute after fitting the indexer on the DataFrame:

indexer = StringIndexer(inputCol="name", outputCol="name_index")
indexer_fitted = indexer.fit(data_frame)
labels = indexer_fitted.labels
new_data_frame = indexer_fitted.transform(data_frame)

但是,当我使用管道时,这似乎是不可能的,或者至少我不知道该怎么做.

However when I use the pipeline, this doesn't seem possible, or at least I don't know how to do this.

所以我想我的问题归结为: 是否可以访问在索引过程中为每个单独的列使用的标签?

So I guess my question comes down to: Is there a way to access the labels that were used during the indexing process for each individual column?

或者在这个用例中我是否必须放弃管道,例如循环遍历StringIndexer对象的列表并手动执行? (我敢肯定这是可能的.但是使用管道会更好一些)

Or will I have to ditch the pipeline in this use-case, and for example loop through the list of StringIndexer objects and do it manually? (I'm sure that would possible. However using the pipeline would just be a lot nicer)

推荐答案

示例数据和Pipeline:

from pyspark.ml.feature import StringIndexer, StringIndexerModel

df = spark.createDataFrame([("a", "foo"), ("b", "bar")], ("x1", "x2"))

pipeline = Pipeline(stages=[
    StringIndexer(inputCol=c, outputCol='{}_index'.format(c))
    for c in df.columns
])

model = pipeline.fit(df)

stages中提取:

# Accessing _java_obj shouldn't be necessary in Spark 2.3+
{x._java_obj.getOutputCol(): x.labels 
for x in model.stages if isinstance(x, StringIndexerModel)}

{'x1_index': ['a', 'b'], 'x2_index': ['foo', 'bar']}

来自转换后的DataFrame的元数据:

From metadata of the transformed DataFrame:

indexed = model.transform(df)

{c.name: c.metadata["ml_attr"]["vals"]
for c in indexed.schema.fields if c.name.endswith("_index")}

{'x1_index': ['a', 'b'], 'x2_index': ['foo', 'bar']}

这篇关于从Spark(pyspark)的管道中的StringIndexer阶段获取标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆