用几列的空值创建DataFrame [英] Create DataFrame with null value for few column
问题描述
我正在尝试使用 RDD
创建一个 DataFrame
.
I am trying to create a DataFrame
using RDD
.
首先,我使用以下代码创建 RDD
-
First I am creating a RDD
using below code -
val account = sc.parallelize(Seq(
(1, null, 2,"F"),
(2, 2, 4, "F"),
(3, 3, 6, "N"),
(4,null,8,"F")))
一切正常-
帐户:org.apache.spark.rdd.RDD [(Int,Any,Int,String)] =ParallelCollectionRDD [0]的并行度为:27
account: org.apache.spark.rdd.RDD[(Int, Any, Int, String)] = ParallelCollectionRDD[0] at parallelize at :27
,但是当尝试使用以下代码从 RDD
创建 DataFrame
时
but when try to create DataFrame
from the RDD
using below code
account.toDF("ACCT_ID", "M_CD", "C_CD","IND")
我遇到错误了
java.lang.UnsupportedOperationException:类型为Any的架构不是支持
java.lang.UnsupportedOperationException: Schema for type Any is not supported
我分析了,每当我将 null
值放入 Seq
时,只有我得到了错误.
I analyzed that whenever I put null
value in Seq
then only I got the error.
有什么方法可以添加空值?
Is there any way to add null value?
推荐答案
问题是Any类型太笼统,Spark却不知道如何序列化它.您应该明确提供一些特定类型,以您的情况为 Integer
.由于无法在Scala中将null分配给原始类型,因此可以改用 java.lang.Integer
.所以试试这个:
The problem is that Any is too general type and Spark just has no idea how to serialize it. You should explicitly provide some specific type, in your case Integer
. Since null can't be assigned to primitive types in Scala you can use java.lang.Integer
instead. So try this:
val account = sc.parallelize(Seq(
(1, null.asInstanceOf[Integer], 2,"F"),
(2, new Integer(2), 4, "F"),
(3, new Integer(3), 6, "N"),
(4, null.asInstanceOf[Integer],8,"F")))
以下是输出:
rdd: org.apache.spark.rdd.RDD[(Int, Integer, Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24
以及相应的DataFrame:
And the corresponding DataFrame:
scala> val df = rdd.toDF("ACCT_ID", "M_CD", "C_CD","IND")
df: org.apache.spark.sql.DataFrame = [ACCT_ID: int, M_CD: int ... 2 more fields]
scala> df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1|null| 2| F|
| 2| 2| 4| F|
| 3| 3| 6| N|
| 4|null| 8| F|
+-------+----+----+---+
此外,您还可以考虑以下更简洁的方法来声明空整数值:
Also you can consider some cleaner way to declare the null integer value like:
object Constants {
val NullInteger: java.lang.Integer = null
}
这篇关于用几列的空值创建DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!