用几列的空值创建DataFrame [英] Create DataFrame with null value for few column

查看:134
本文介绍了用几列的空值创建DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 RDD 创建一个 DataFrame .

I am trying to create a DataFrame using RDD.

首先,我使用以下代码创建 RDD -

First I am creating a RDD using below code -

val account = sc.parallelize(Seq(
                                 (1, null, 2,"F"), 
                                 (2, 2, 4, "F"),
                                 (3, 3, 6, "N"),
                                 (4,null,8,"F")))

一切正常-

帐户:org.apache.spark.rdd.RDD [(Int,Any,Int,String)] =ParallelCollectionRDD [0]的并行度为:27

account: org.apache.spark.rdd.RDD[(Int, Any, Int, String)] = ParallelCollectionRDD[0] at parallelize at :27

,但是当尝试使用以下代码从 RDD 创建 DataFrame

but when try to create DataFrame from the RDD using below code

account.toDF("ACCT_ID", "M_CD", "C_CD","IND")

我遇到错误了

java.lang.UnsupportedOperationException:类型为Any的架构不是支持

java.lang.UnsupportedOperationException: Schema for type Any is not supported

我分析了,每当我将 null 值放入 Seq 时,只有我得到了错误.

I analyzed that whenever I put null value in Seq then only I got the error.

有什么方法可以添加空值?

Is there any way to add null value?

推荐答案

问题是Any类型太笼统,Spark却不知道如何序列化它.您应该明确提供一些特定类型,以您的情况为 Integer .由于无法在Scala中将null分配给原始类型,因此可以改用 java.lang.Integer .所以试试这个:

The problem is that Any is too general type and Spark just has no idea how to serialize it. You should explicitly provide some specific type, in your case Integer. Since null can't be assigned to primitive types in Scala you can use java.lang.Integer instead. So try this:

val account = sc.parallelize(Seq(
                                 (1, null.asInstanceOf[Integer], 2,"F"), 
                                 (2, new Integer(2), 4, "F"),
                                 (3, new Integer(3), 6, "N"),
                                 (4, null.asInstanceOf[Integer],8,"F")))

以下是输出:

rdd: org.apache.spark.rdd.RDD[(Int, Integer, Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24

以及相应的DataFrame:

And the corresponding DataFrame:

scala> val df = rdd.toDF("ACCT_ID", "M_CD", "C_CD","IND")

df: org.apache.spark.sql.DataFrame = [ACCT_ID: int, M_CD: int ... 2 more fields]

scala> df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
|      1|null|   2|  F|
|      2|   2|   4|  F|
|      3|   3|   6|  N|
|      4|null|   8|  F|
+-------+----+----+---+

此外,您还可以考虑以下更简洁的方法来声明空整数值:

Also you can consider some cleaner way to declare the null integer value like:

object Constants {
  val NullInteger: java.lang.Integer = null
}

这篇关于用几列的空值创建DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆