如何在Spark DataFrame中将字符串值编码为数值 [英] How to encode string values into numeric values in Spark DataFrame

查看:569
本文介绍了如何在Spark DataFrame中将字符串值编码为数值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含两列的DataFrame:

I have a DataFrame with two columns:

df = 
  Col1   Col2
  aaa    bbb
  ccc    aaa

我想将字符串值编码为数字值.我设法做到了:

I want to encode String values into numeric values. I managed to do it in this way:

import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}

val indexer1 = new StringIndexer()
                    .setInputCol("Col1")
                    .setOutputCol("Col1Index")
                    .fit(df)

val indexer2 = new StringIndexer()
                    .setInputCol("Col2")
                    .setOutputCol("Col2Index")
                    .fit(df)

val indexed1 = indexer1.transform(df)
val indexed2 = indexer2.transform(df)

val encoder1 = new OneHotEncoder()
                    .setInputCol("Col1Index")
                    .setOutputCol("Col1Vec")

val encoder2 = new OneHotEncoder()
                    .setInputCol("Col2Index")
                    .setOutputCol("Col2Vec")

val encoded1 = encoder1.transform(indexed1)
encoded1.show()

val encoded2 = encoder2.transform(indexed2)
encoded2.show()

问题在于aaa在两列中以不同的方式编码. 如何对DataFrame进行编码,以便正确编码新的数据帧,例如:

The problem is that aaa is encoded in different ways in two columns. How can I encode my DataFrame in order to get the new one correctly encoded, e.g.:

df_encoded = 
   Col1   Col2
   1      2
   3      1

推荐答案

在两列上训练单个Indexer:

val df = Seq(("aaa", "bbb"), ("ccc", "aaa")).toDF("col1", "col2")

val indexer = new StringIndexer().setInputCol("col").fit(
   df.select("col1").toDF("col").union(df.select("col2").toDF("col"))
)

并在每列上应用副本

import org.apache.spark.ml.param.ParamMap

val result = Seq("col1", "col2").foldLeft(df){
  (df, col) => indexer
    .copy(new ParamMap()
      .put(indexer.inputCol, col)
      .put(indexer.outputCol, s"${col}_idx"))
    .transform(df)
}

result.show
// +----+----+--------+--------+
// |col1|col2|col1_idx|col2_idx|
// +----+----+--------+--------+
// | aaa| bbb|     0.0|     1.0|
// | ccc| aaa|     2.0|     0.0|
// +----+----+--------+--------+

这篇关于如何在Spark DataFrame中将字符串值编码为数值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆