如何在Spark DataFrame中将字符串值编码为数值 [英] How to encode string values into numeric values in Spark DataFrame
本文介绍了如何在Spark DataFrame中将字符串值编码为数值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含两列的DataFrame:
I have a DataFrame with two columns:
df =
Col1 Col2
aaa bbb
ccc aaa
我想将字符串值编码为数字值.我设法做到了:
I want to encode String values into numeric values. I managed to do it in this way:
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val indexer1 = new StringIndexer()
.setInputCol("Col1")
.setOutputCol("Col1Index")
.fit(df)
val indexer2 = new StringIndexer()
.setInputCol("Col2")
.setOutputCol("Col2Index")
.fit(df)
val indexed1 = indexer1.transform(df)
val indexed2 = indexer2.transform(df)
val encoder1 = new OneHotEncoder()
.setInputCol("Col1Index")
.setOutputCol("Col1Vec")
val encoder2 = new OneHotEncoder()
.setInputCol("Col2Index")
.setOutputCol("Col2Vec")
val encoded1 = encoder1.transform(indexed1)
encoded1.show()
val encoded2 = encoder2.transform(indexed2)
encoded2.show()
问题在于aaa
在两列中以不同的方式编码.
如何对DataFrame进行编码,以便正确编码新的数据帧,例如:
The problem is that aaa
is encoded in different ways in two columns.
How can I encode my DataFrame in order to get the new one correctly encoded, e.g.:
df_encoded =
Col1 Col2
1 2
3 1
推荐答案
在两列上训练单个Indexer
:
val df = Seq(("aaa", "bbb"), ("ccc", "aaa")).toDF("col1", "col2")
val indexer = new StringIndexer().setInputCol("col").fit(
df.select("col1").toDF("col").union(df.select("col2").toDF("col"))
)
并在每列上应用副本
import org.apache.spark.ml.param.ParamMap
val result = Seq("col1", "col2").foldLeft(df){
(df, col) => indexer
.copy(new ParamMap()
.put(indexer.inputCol, col)
.put(indexer.outputCol, s"${col}_idx"))
.transform(df)
}
result.show
// +----+----+--------+--------+
// |col1|col2|col1_idx|col2_idx|
// +----+----+--------+--------+
// | aaa| bbb| 0.0| 1.0|
// | ccc| aaa| 2.0| 0.0|
// +----+----+--------+--------+
这篇关于如何在Spark DataFrame中将字符串值编码为数值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文