从Spark写入镶木地板时如何处理空值 [英] How to handle null values when writing to parquet from Spark

查看:95
本文介绍了从Spark写入镶木地板时如何处理空值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

直到最近parquet不支持null值-这是一个有疑问的前提.实际上,最新版本确实添加了该支持:

Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

但是,spark支持该新的parquet功能还需要很长时间-如果有的话.这是关联的(closed - will not fix)JIRA:

However it will be a long time before spark supports that new parquet feature - if ever. Here is the associated (closed - will not fix) JIRA:

https://issues.apache.org/jira/browse/SPARK-10943

那么当人们将dataframe写入parquet时,人们今天如何处理空列值 呢?我只能想到非常丑陋的骇客程序,例如编写空字符串,然后..很好..我知道如何使用数值来表示null-简短放置一些哨兵值并让我的代码检查它(不方便且容易出错).

So what are folks doing with regards to null column values today when writing out dataframe's to parquet ? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null - short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).

推荐答案

您误解了 SPARK- 10943 . Spark确实支持将null值写入数字列.

You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns.

问题在于null本身根本不包含类型信息

The problem is that null alone carries no type information at all

scala> spark.sql("SELECT null as comments").printSchema
root
 |-- comments: null (nullable = true)

根据 Michael Armbrust 的comment-14959304"rel =" noreferrer>评论,您要做的就是演员:

As per comment by Michael Armbrust all you have to do is cast:

scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema
root
|-- comments: double (nullable = true)

结果可以安全地写入Parquet.

and the result can be safely written to Parquet.

这篇关于从Spark写入镶木地板时如何处理空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆