将 StringIndexer 应用于 PySpark 数据帧中的几列 [英] Apply StringIndexer to several columns in a PySpark Dataframe

查看:21
本文介绍了将 StringIndexer 应用于 PySpark 数据帧中的几列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 PySpark 数据框

I have a PySpark dataframe

+-------+--------------+----+----+
|address|          date|name|food|
+-------+--------------+----+----+
|1111111|20151122045510| Yin|gre |
|1111111|20151122045501| Yin|gre |
|1111111|20151122045500| Yln|gra |
|1111112|20151122065832| Yun|ddd |
|1111113|20160101003221| Yan|fdf |
|1111111|20160703045231| Yin|gre |
|1111114|20150419134543| Yin|fdf |
|1111115|20151123174302| Yen|ddd |
|2111115|      20123192| Yen|gre |
+-------+--------------+----+----+

我想转换为与 pyspark.ml 一起使用.我可以使用 StringIndexer 将名称列转换为数字类别:

that I want to transform to use with pyspark.ml. I can use a StringIndexer to convert the name column to a numeric category:

indexer = StringIndexer(inputCol="name", outputCol="name_index").fit(df)
df_ind = indexer.transform(df)
df_ind.show()
+-------+--------------+----+----------+----+
|address|          date|name|name_index|food|
+-------+--------------+----+----------+----+
|1111111|20151122045510| Yin|       0.0|gre |
|1111111|20151122045501| Yin|       0.0|gre |
|1111111|20151122045500| Yln|       2.0|gra |
|1111112|20151122065832| Yun|       4.0|ddd |
|1111113|20160101003221| Yan|       3.0|fdf |
|1111111|20160703045231| Yin|       0.0|gre |
|1111114|20150419134543| Yin|       0.0|fdf |
|1111115|20151123174302| Yen|       1.0|ddd |
|2111115|      20123192| Yen|       1.0|gre |
+-------+--------------+----+----------+----+

如何使用 StringIndexer 转换多个列(例如,namefood,每个列都有自己的 StringIndexer),然后使用 <一个 href="https://stackoverflow.com/questions/32606294/create-feature-vector-programmatically-in-spark-ml-pyspark">VectorAssembler 来生成特征向量?或者我必须为每一列创建一个 StringIndexer 吗?

How can I transform several columns with StringIndexer (for example, name and food, each with its own StringIndexer) and then use VectorAssembler to generate a feature vector? Or do I have to create a StringIndexer for each column?

** 编辑 **:这不是一个骗局,因为我需要以编程方式针对具有不同列名的几个数据框进行此操作.我不能使用 VectorIndexerVectorAssembler 因为列不是数字.

** EDIT **: This is not a dupe because I need to to this programatically for several data frames with different column names. I can't use VectorIndexer or VectorAssembler because the columns are not numerical.

** EDIT 2**:暂定的解决方案是

** EDIT 2**: A tentative solution is

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df).transform(df) for column in df.columns ]

我现在用三个数据框创建一个列表,每个数据框都与原始数据框和转换后的列相同.现在我需要加入 then 以形成最终的数据帧,但这非常低效.

where I create a list now with three dataframes, each identical to the original plus the transformed column. Now I need to join then to form the final dataframe, but that's very inefficient.

推荐答案

我发现最好的方法是将多个 StringIndex 组合到一个列表中并使用 Pipeline 来执行它们:

The best way that I've found to do it is to combine several StringIndex on a list and use a Pipeline to execute them all:

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]


pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)

df_r.show()
+-------+--------------+----+----+----------+----------+-------------+
|address|          date|food|name|food_index|name_index|address_index|
+-------+--------------+----+----+----------+----------+-------------+
|1111111|20151122045510| gre| Yin|       0.0|       0.0|          0.0|
|1111111|20151122045501| gra| Yin|       2.0|       0.0|          0.0|
|1111111|20151122045500| gre| Yln|       0.0|       2.0|          0.0|
|1111112|20151122065832| gre| Yun|       0.0|       4.0|          3.0|
|1111113|20160101003221| gre| Yan|       0.0|       3.0|          1.0|
|1111111|20160703045231| gre| Yin|       0.0|       0.0|          0.0|
|1111114|20150419134543| gre| Yin|       0.0|       0.0|          5.0|
|1111115|20151123174302| ddd| Yen|       1.0|       1.0|          2.0|
|2111115|      20123192| ddd| Yen|       1.0|       1.0|          4.0|
+-------+--------------+----+----+----------+----------+-------------+

这篇关于将 StringIndexer 应用于 PySpark 数据帧中的几列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆