在使用JSON的情况下,当模式推断留给Spark时,为什么Spark会输出nullable = true? [英] Why Spark outputs nullable = true, when schema inference left to Spark, in case of JSON?

查看:252
本文介绍了在使用JSON的情况下,当模式推断留给Spark时,为什么Spark会输出nullable = true?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当未指定架构并将其推论留给Spark时,为什么Spark会显示 nullable = true ?

Why does Spark show nullable = true, when schema is not specified and its inference is left to Spark ?

// shows nullable = true for fields which are present in all JSON records.
spark.read.json("s3://s3path").printSchema() 

通过类

Going through the class JsonInferSchema, can see that for StructType, explicitly nullable is set to true. But am unable to understand the reason behind it.

PS:我的目的是为大型JSON数据集(小于100GB)推断模式,并希望了解Spark是否提供了该功能,或者是否必须编写自定义的map-reduce作业,如本文中突出显示的那样: 大规模JSON数据集的架构推断.一个主要的部分是我想知道哪些字段是可选的,哪些字段是必填的(没有数据集).

PS: My aim is to infer schema for a large JSON data set (< 100GB), and wanted to see if Spark provides the ability or would have to write a custom map-reduce job as highlighted in the paper: Schema Inference for Massive JSON Datasets. One major part is I want to know which fields are optional and which are mandatory (w.r.t the data set).

推荐答案

因为它可能会对模式推断的数据进行抽样,其中由于检查范围有限,样本大小而无法100%推断是否为null .因此更安全地设置为null.这么简单.

Because it may do a sample of the data for schema inference in which it cannot 100% infer if null or not null, due to limited checking scope, sample size. Hence safer to set to null. That simple.

这篇关于在使用JSON的情况下,当模式推断留给Spark时,为什么Spark会输出nullable = true?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆