如何将具有nullable = false的架构应用于json读取 [英] How do I apply schema with nullable = false to json reading

查看:163
本文介绍了如何将具有nullable = false的架构应用于json读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用json文件为数据帧编写一些测试用例(而生产将是镶木地板).我正在使用基于spark-testing-base的框架,并且由于架构不匹配而断言数据帧彼此相等时遇到了障碍,其中json架构始终具有nullable = true.

I'm trying to write some test cases using json files for dataframes (whereas production would be parquet). I'm using spark-testing-base framework and I'm running into a snag when asserting data frames equal each other due to schema mismatches where the json schema always has nullable = true.

我希望能够将具有nullable = false的模式应用于读取的json.

I'd like to be able to apply a schema with nullable = false to the json read.

我写了一个小测试用例:

I've written a small test case:

import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
import org.scalatest.FunSuite

class TestJSON extends FunSuite with DataFrameSuiteBase {

  val expectedSchema = StructType(
    List(StructField("a", IntegerType, nullable = false),
         StructField("b", IntegerType, nullable = true))
  )
  test("testJSON") {
    val readJson =
      spark.read.schema(expectedSchema).json("src/test/resources/test.json")

    assert(readJson.schema == expectedSchema)

  }
}

,并具有一个小的test.json文件: {"a": 1, "b": 2} {"a": 1}

And have a small test.json file of: {"a": 1, "b": 2} {"a": 1}

这将返回

StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true))不相等 StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true))ScalaTestFailureLocation: 预期的TestJSON $$ anonfun $ 1(TestJSON.scala:15) :StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true))实际
:StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true))

StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true)) did not equal StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true)) ScalaTestFailureLocation: TestJSON$$anonfun$1 at (TestJSON.scala:15) Expected :StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true)) Actual
:StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true))

我是否以正确的方式应用架构? 我正在使用Spark 2.2,Scala 2.11.8

Am I applying the schema the correct way? I'm using spark 2.2, scala 2.11.8

推荐答案

有一种解决方法,该方法不是直接从文件中读取json,而是使用RDD读取它,然后应用架构.下面是代码:

There is a workaround, where rather than reading the json directly from the file, read it using RDD then it applies the schema. Below is code:

val expectedSchema = StructType(
    List(StructField("a", IntegerType, nullable = false),
         StructField("b", IntegerType, nullable = true))
  )


  test("testJSON") {
    val jsonRdd =spark.sparkContext.textFile("src/test/resources/test.json")
    //val readJson =sparksession.read.schema(expectedSchema).json("src/test/resources/test.json")
    val readJson = spark.read.schema(expectedSchema).json(jsonRdd)
    readJson.printSchema()
    assert(readJson.schema == expectedSchema)

  }

测试用例通过,打印模式结果为:

The test case passes and the print schema result is :

root
 |-- a: integer (nullable = false)
 |-- b: integer (nullable = true)

有JIRA https://issues.apache.org/jira/browse/SPARK -10848 和Apache Spark一起解决了这个问题,他们说这不是问题,并说:

There is JIRA https://issues.apache.org/jira/browse/SPARK-10848 with apache Spark for this issue, which they say is not a problem and said that:

这应该在Spark 2.0中以最新的文件格式重构来解决.如果您仍然遇到问题,请重新打开它.谢谢!

This should be resolved in the latest file format refactoring in Spark 2.0. Please reopen it if you still hit the problem. Thanks!

如果遇到错误,可以再次打开JIRA. 我在spark 2.1.0中进行了测试,但仍然看到相同的问题

If you are getting the error you can open the JIRA again. I tested in spark 2.1.0, and still see the same issue

这篇关于如何将具有nullable = false的架构应用于json读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆