将StringIndexer应用于PySpark数据框中的几列 [英] Apply StringIndexer to several columns in a PySpark Dataframe
问题描述
我有一个PySpark数据框
I have a PySpark dataframe
+-------+--------------+----+----+
|address| date|name|food|
+-------+--------------+----+----+
|1111111|20151122045510| Yin|gre |
|1111111|20151122045501| Yin|gre |
|1111111|20151122045500| Yln|gra |
|1111112|20151122065832| Yun|ddd |
|1111113|20160101003221| Yan|fdf |
|1111111|20160703045231| Yin|gre |
|1111114|20150419134543| Yin|fdf |
|1111115|20151123174302| Yen|ddd |
|2111115| 20123192| Yen|gre |
+-------+--------------+----+----+
我要转换为与pyspark.ml一起使用的
.我可以使用StringIndexer将名称列转换为数字类别:
that I want to transform to use with pyspark.ml. I can use a StringIndexer to convert the name column to a numeric category:
indexer = StringIndexer(inputCol="name", outputCol="name_index").fit(df)
df_ind = indexer.transform(df)
df_ind.show()
+-------+--------------+----+----------+----+
|address| date|name|name_index|food|
+-------+--------------+----+----------+----+
|1111111|20151122045510| Yin| 0.0|gre |
|1111111|20151122045501| Yin| 0.0|gre |
|1111111|20151122045500| Yln| 2.0|gra |
|1111112|20151122065832| Yun| 4.0|ddd |
|1111113|20160101003221| Yan| 3.0|fdf |
|1111111|20160703045231| Yin| 0.0|gre |
|1111114|20150419134543| Yin| 0.0|fdf |
|1111115|20151123174302| Yen| 1.0|ddd |
|2111115| 20123192| Yen| 1.0|gre |
+-------+--------------+----+----------+----+
如何使用StringIndexer转换几列(例如,name
和food
,每个都有自己的StringIndexer
),然后使用
How can I transform several columns with StringIndexer (for example, name
and food
, each with its own StringIndexer
) and then use VectorAssembler to generate a feature vector? Or do I have to create a StringIndexer
for each column?
**编辑**:这不是一个重复,因为我需要以编程方式针对具有不同列名的多个数据帧执行此操作.我不能使用VectorIndexer
或VectorAssembler
,因为列不是数字.
** EDIT **: This is not a dupe because I need to to this programatically for several data frames with different column names. I can't use VectorIndexer
or VectorAssembler
because the columns are not numerical.
**编辑2 **:暂时的解决方案是
** EDIT 2**: A tentative solution is
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df).transform(df) for column in df.columns ]
现在我在其中创建一个包含三个数据帧的列表,每个数据帧均与原始数据以及转换后的列相同.现在,我需要加入然后形成最终的数据框,但这效率很低.
where I create a list now with three dataframes, each identical to the original plus the transformed column. Now I need to join then to form the final dataframe, but that's very inefficient.
推荐答案
我发现最好的方法是将多个StringIndex
组合在一个列表上,然后使用Pipeline
来执行它们:
The best way that I've found to do it is to combine several StringIndex
on a list and use a Pipeline
to execute them all:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r.show()
+-------+--------------+----+----+----------+----------+-------------+
|address| date|food|name|food_index|name_index|address_index|
+-------+--------------+----+----+----------+----------+-------------+
|1111111|20151122045510| gre| Yin| 0.0| 0.0| 0.0|
|1111111|20151122045501| gra| Yin| 2.0| 0.0| 0.0|
|1111111|20151122045500| gre| Yln| 0.0| 2.0| 0.0|
|1111112|20151122065832| gre| Yun| 0.0| 4.0| 3.0|
|1111113|20160101003221| gre| Yan| 0.0| 3.0| 1.0|
|1111111|20160703045231| gre| Yin| 0.0| 0.0| 0.0|
|1111114|20150419134543| gre| Yin| 0.0| 0.0| 5.0|
|1111115|20151123174302| ddd| Yen| 1.0| 1.0| 2.0|
|2111115| 20123192| ddd| Yen| 1.0| 1.0| 4.0|
+-------+--------------+----+----+----------+----------+-------------+
这篇关于将StringIndexer应用于PySpark数据框中的几列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!