Spark DataFrame 在 OneHotEncoder 中处理空字符串 [英] Spark DataFrame handing empty String in OneHotEncoder

查看:31
本文介绍了Spark DataFrame 在 OneHotEncoder 中处理空字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将 CSV 文件(使用 spark-csv)导入到具有空 String 值的 DataFrame 中.当应用 OneHotEncoder 时,应用程序崩溃并出现错误 requirement failed: Cannot have a empty string for name..有没有办法解决这个问题?

I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder, the application crashes with error requirement failed: Cannot have an empty string for name.. Is there a way I can get around this?

我可以重现 Spark ml 上提供的示例中的错误 页面:

val df = sqlContext.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, ""),         //<- original example has "a" here
  (4, "a"),
  (5, "c")
)).toDF("id", "category")

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")
  .fit(df)
val indexed = indexer.transform(df)

val encoder = new OneHotEncoder()
  .setInputCol("categoryIndex")
  .setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)

encoded.show()

这很烦人,因为缺失值/空值是一种高度通用的情况.

It is annoying since missing/empty values is a highly generic case.

提前致谢,尼基尔

推荐答案

由于 OneHotEncoder/OneHotEncoderEstimator 不接受空字符串作为名称,否则你会得到以下错误:

Since the OneHotEncoder/OneHotEncoderEstimator does not accept empty string for name, or you'll get the following error :

java.lang.IllegalArgumentException:要求失败:名称不能有空字符串.在 scala.Predef$.require(Predef.scala:233)在 org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:33)在 org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:32)[...]

java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:33) at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:32) [...]

这就是我要做的:(还有其他方法可以做到,rf.@Anthony 的回答)

This is how I will do it : (There is other way to do it, rf. @Anthony 's answer)

我将创建一个 UDF 来处理空类别:

I'll create an UDF to process the empty category :

import org.apache.spark.sql.functions._

def processMissingCategory = udf[String, String] { s => if (s == "") "NA"  else s }

然后,我将在列上应用 UDF:

Then, I'll apply the UDF on the column :

val df = sqlContext.createDataFrame(Seq(
   (0, "a"),
   (1, "b"),
   (2, "c"),
   (3, ""),         //<- original example has "a" here
   (4, "a"),
   (5, "c")
)).toDF("id", "category")
  .withColumn("category",processMissingCategory('category))

df.show
// +---+--------+
// | id|category|
// +---+--------+
// |  0|       a|
// |  1|       b|
// |  2|       c|
// |  3|      NA|
// |  4|       a|
// |  5|       c|
// +---+--------+

现在,您可以返回到您的转换

Now, you can go back to your transformations

val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// |  0|       a|          0.0|
// |  1|       b|          2.0|
// |  2|       c|          1.0|
// |  3|      NA|          3.0|
// |  4|       a|          0.0|
// |  5|       c|          1.0|
// +---+--------+-------------+

// Spark <2.3
// val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
// Spark +2.3
val encoder = new OneHotEncoderEstimator().setInputCols(Array("categoryIndex")).setOutputCols(Array("category2Vec"))
val encoded = encoder.transform(indexed)

encoded.show
// +---+--------+-------------+-------------+
// | id|category|categoryIndex|  categoryVec|
// +---+--------+-------------+-------------+
// |  0|       a|          0.0|(3,[0],[1.0])|
// |  1|       b|          2.0|(3,[2],[1.0])|
// |  2|       c|          1.0|(3,[1],[1.0])|
// |  3|      NA|          3.0|    (3,[],[])|
// |  4|       a|          0.0|(3,[0],[1.0])|
// |  5|       c|          1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+

@Anthony 在 Scala 中的解决方案:

@Anthony 's solution in Scala :

df.na.replace("category", Map( "" -> "NA")).show
// +---+--------+
// | id|category|
// +---+--------+
// |  0|       a|
// |  1|       b|
// |  2|       c|
// |  3|      NA|
// |  4|       a|
// |  5|       c|
// +---+--------+

我希望这会有所帮助!

这篇关于Spark DataFrame 在 OneHotEncoder 中处理空字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆