从Spark写入镶木地板时如何处理空值 [英] How to handle null values when writing to parquet from Spark
问题描述
直到最近parquet
不支持null
值-这是一个有疑问的前提.实际上,最新版本确实添加了该支持:
Until recently parquet
did not support null
values - a questionable premise. In fact a recent version did finally add that support:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
但是,spark
支持该新的parquet
功能还需要很长时间-如果有的话.这是关联的(closed - will not fix
)JIRA:
However it will be a long time before spark
supports that new parquet
feature - if ever. Here is the associated (closed - will not fix
) JIRA:
https://issues.apache.org/jira/browse/SPARK-10943
那么当人们将dataframe
写入parquet
时,人们今天如何处理空列值 呢?我只能想到非常丑陋的骇客程序,例如编写空字符串,然后..很好..我不知道如何使用数值来表示null
-简短放置一些哨兵值并让我的代码检查它(不方便且容易出错).
So what are folks doing with regards to null column values today when writing out dataframe
's to parquet
? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null
- short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).
推荐答案
您误解了 SPARK- 10943 . Spark确实支持将null
值写入数字列.
You misinterpreted SPARK-10943. Spark does support writing null
values to numeric columns.
问题在于null
本身根本不包含类型信息
The problem is that null
alone carries no type information at all
scala> spark.sql("SELECT null as comments").printSchema
root
|-- comments: null (nullable = true)
根据 Michael Armbrust 的comment-14959304"rel =" noreferrer>评论,您要做的就是演员:
As per comment by Michael Armbrust all you have to do is cast:
scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema
root
|-- comments: double (nullable = true)
结果可以安全地写入Parquet.
and the result can be safely written to Parquet.
这篇关于从Spark写入镶木地板时如何处理空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!