在使用JSON的情况下,当模式推断留给Spark时,为什么Spark会输出nullable = true? [英] Why Spark outputs nullable = true, when schema inference left to Spark, in case of JSON?
问题描述
当未指定架构并将其推论留给Spark时,为什么Spark会显示 nullable = true ?
Why does Spark show nullable = true, when schema is not specified and its inference is left to Spark ?
// shows nullable = true for fields which are present in all JSON records.
spark.read.json("s3://s3path").printSchema()
Going through the class JsonInferSchema, can see that for StructType, explicitly nullable is set to true. But am unable to understand the reason behind it.
PS:我的目的是为大型JSON数据集(小于100GB)推断模式,并希望了解Spark是否提供了该功能,或者是否必须编写自定义的map-reduce作业,如本文中突出显示的那样: 大规模JSON数据集的架构推断.一个主要的部分是我想知道哪些字段是可选的,哪些字段是必填的(没有数据集).
PS: My aim is to infer schema for a large JSON data set (< 100GB), and wanted to see if Spark provides the ability or would have to write a custom map-reduce job as highlighted in the paper: Schema Inference for Massive JSON Datasets. One major part is I want to know which fields are optional and which are mandatory (w.r.t the data set).
推荐答案
因为它可能会对模式推断的数据进行抽样,其中由于检查范围有限,样本大小而无法100%推断是否为null .因此更安全地设置为null.这么简单.
Because it may do a sample of the data for schema inference in which it cannot 100% infer if null or not null, due to limited checking scope, sample size. Hence safer to set to null. That simple.
这篇关于在使用JSON的情况下,当模式推断留给Spark时,为什么Spark会输出nullable = true?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!