Spark 2 数据集空值异常 [英] Spark 2 Dataset Null value exception

查看:42
本文介绍了Spark 2 数据集空值异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 spark Dataset.filter 中得到这个 null 错误

Getting this null error in spark Dataset.filter

输入 CSV:

name,age,stat
abc,22,m
xyz,,s

工作代码:

case class Person(name: String, age: Long, stat: String)

val peopleDS = spark.read.option("inferSchema","true")
  .option("header", "true").option("delimiter", ",")
  .csv("./people.csv").as[Person]
peopleDS.show()
peopleDS.createOrReplaceTempView("people")
spark.sql("select * from people where age > 30").show()

失败代码(添加以下行返回错误):

val filteredDS = peopleDS.filter(_.age > 30)
filteredDS.show()

返回空错误

java.lang.RuntimeException: Null value appeared in non-nullable field:
- field (class: "scala.Long", name: "age")
- root class: "com.gcp.model.Person"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).

推荐答案

你得到的异常应该解释一切,但让我们一步一步来:

Exception you get should explain everything but let's go step-by-step:

  • 使用csv数据源加载数据时,所有字段都标记为nullable:

  • When load data using csv data source all fields are marked as nullable:

val path: String = ???

val peopleDF = spark.read
  .option("inferSchema","true")
  .option("header", "true")
  .option("delimiter", ",")
  .csv(path)

peopleDF.printSchema

root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- stat: string (nullable = true)

  • 缺失字段表示为 SQL NULL

    peopleDF.where($"age".isNull).show
    

    +----+----+----+
    |name| age|stat|
    +----+----+----+
    | xyz|null|   s|
    +----+----+----+
    

  • 接下来将 Dataset[Row] 转换为 Dataset[Person],后者使用 Long 来编码 age字段.Scala 中的Long 不能为null.因为输入模式是 nullable,输出模式保持 nullable,尽管如此:

  • Next you convert Dataset[Row] to Dataset[Person] which uses Long to encode age field. Long in Scala cannot be null. Because input schema is nullable, output schema stays nullable despite of that:

    val peopleDS = peopleDF.as[Person]
    
    peopleDS.printSchema
    

    root
     |-- name: string (nullable = true)
     |-- age: integer (nullable = true)
     |-- stat: string (nullable = true)
    

    请注意,它 as[T] 根本不影响架构.

    Note that it as[T] doesn't affect the schema at all.

    当您使用 SQL(在注册表上)或 DataFrame API 查询 Dataset 时,Spark 不会反序列化对象.由于模式仍然是 nullable 我们可以执行:

    When you query Dataset using SQL (on registered table) or DataFrame API Spark won't deserialize the object. Since schema is still nullable we can execute:

    peopleDS.where($"age" > 30).show
    

    +----+---+----+
    |name|age|stat|
    +----+---+----+
    +----+---+----+
    

    没有任何问题.这只是一个简单的 SQL 逻辑,NULL 是一个有效值.

    without any issues. This is just a plain SQL logic and NULL is a valid value.

    当我们使用静态类型的Dataset API 时:

    When we use statically typed Dataset API:

    peopleDS.filter(_.age > 30)
    

    Spark 必须反序列化对象.因为 Long 不能是 null (SQL NULL),所以它失败了,你已经看到了异常.

    Spark has to deserialize the object. Because Long cannot be null (SQL NULL) it fails with exception you've seen.

    如果不是这样,你会得到 NPE.

    If it wasn't for that you'd get NPE.

    正确的数据静态类型表示应该使用 Optional 类型:

    Correct statically typed representation of your data should use Optional types:

    case class Person(name: String, age: Option[Long], stat: String)
    

    带调整过滤功能:

    peopleDS.filter(_.age.map(_ > 30).getOrElse(false))
    

    +----+---+----+
    |name|age|stat|
    +----+---+----+
    +----+---+----+
    

    如果您愿意,可以使用模式匹配:

    If you prefer you can use pattern matching:

    peopleDS.filter {
      case Some(age) => age > 30
      case _         => false     // or case None => false
    }
    

    请注意,您不必(但仍然建议)为 namestat 使用可选类型.因为 Scala String 只是一个 Java String,所以它可以是 null.当然,如果您采用这种方法,则必须明确检查访问的值是否为 null.

    Note that you don't have to (but it would be recommended anyway) to use optional types for name and stat. Because Scala String is just a Java String it can be null. Of course if you go with this approach you have to explicitly check if accessed values are null or not.

    相关Spark 2.0 Dataset vs DataFrame

    这篇关于Spark 2 数据集空值异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆