Spark将csv列中的空值视为空数据类型 [英] Spark treating null values in csv column as null datatype

查看:27
本文介绍了Spark将csv列中的空值视为空数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 spark 应用程序读取 csv 文件,使用 sql 将其转换为不同的格式,并将结果数据帧写入不同的 csv 文件.

My spark application reads a csv file, transforms it to a different format with sql and writes the result dataframe to a different csv file.

例如,我输入的csv如下:

For example, I have input csv as follows:

Id|FirstName|LastName|LocationId
1|John|Doe|123
2|Alex|Doe|234

我的转变是:

Select Id, 
       FirstName, 
       LastName, 
       LocationId as PrimaryLocationId,
       null as SecondaryLocationId
from Input

(我无法回答为什么将 null 用作 SecondaryLocationId,这是业务用例)现在 spark 无法确定 SecondaryLocationId 的数据类型并在架构中返回 null 并在写入输出 csv 时抛出错误 CSV 数据源不支持 null 数据类型.

(I can't answer why the null is being used as SecondaryLocationId, it is business use case) Now spark can't figure out the datatype of SecondaryLocationId and returns null in the schema and throws the error CSV data source does not support null data type while writing to output csv.

以下是我正在使用的 printSchema() 和 write 选项.

Below are printSchema() and write options I am using.

root
     |-- Id: string (nullable = true)
     |-- FirstName: string (nullable = true)
     |-- LastName: string (nullable = true)
     |-- PrimaryLocationId: string (nullable = false)
     |-- SecondaryLocationId: null (nullable = true)

dataFrame.repartition(1).write
      .mode(SaveMode.Overwrite)
      .option("header", "true")
      .option("delimiter", "|")
      .option("nullValue", "")
      .option("inferSchema", "true")
      .csv(outputPath)

有没有办法默认为数据类型(例如字符串)?顺便说一句,我可以通过将 null 替换为空字符串('')来使其工作,但这不是我想要做的.

Is there a way to default to a datatype (such as string)? By the way, I can get this to work by replacing null with empty string('') but that is not what I want to do.

推荐答案

use lit(null): import org.apache.spark.sql.functions.{lit, udf}

示例:

import org.apache.spark.sql.functions.{lit, udf}

case class Record(foo: Int, bar: String)
val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF

val dfWithFoobar = df.withColumn("foobar", lit(null: String))


scala> dfWithFoobar.printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: null (nullable = true)
and it is not retained by the csv writer. If it is a hard requirement you 
 can cast column to the specific type (lets say String):

import org.apache.spark.sql.types.StringType
df.withColumn("foobar", lit(null).cast(StringType))

或者像这样使用 UDF:

or use an UDF like this:

val getNull = udf(() => None: Option[String]) // Or some other type

df.withColumn("foobar", getNull()).printSchema

root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: string (nullable = true)

重新发布 zero323 代码.

reposting zero323 code.

现在让我们讨论你的第二个问题

Now lets discuss your second question

问题:

"只有当我知道哪些列将被视为空数据类型时才会这样做.当读取大量文件并对其应用各种转换时,我不知道或者有没有办法知道哪些字段是空处理的?"

"This is only when I know which columns will be treated as null datatype. When a large number of files are being read and applied various transformations on, then I wouldn't know or is there a way I might know which fields are null treated? "

答案:

在这种情况下,您可以使用选项

In this case you can use option

Databricks Scala 风格指南 不同意 null 应该总是被禁止使用 Scala 代码并说:对于性能敏感的代码,优先使用 null 而不是 Option,以避免虚方法调用和装箱."

The Databricks Scala style guide does not agree that null should always be banned from Scala code and says: "For performance sensitive code, prefer null over Option, in order to avoid virtual method calls and boxing."

示例:

+------+
|number|
+------+
|     1|
|     8|
|    12|
|  null|
+------+


val actualDf = sourceDf.withColumn(
  "is_even",
  when(
    col("number").isNotNull, 
    isEvenSimpleUdf(col("number"))
  ).otherwise(lit(null))
)

actualDf.show()
+------+-------+
|number|is_even|
+------+-------+
|     1|  false|
|     8|   true|
|    12|   true|
|  null|   null|
+------+-------+

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆