如何将具有nullable = false的架构应用于json读取 [英] How do I apply schema with nullable = false to json reading
问题描述
我正在尝试使用json文件为数据帧编写一些测试用例(而生产将是镶木地板).我正在使用基于spark-testing-base的框架,并且由于架构不匹配而断言数据帧彼此相等时遇到了障碍,其中json架构始终具有nullable = true.
I'm trying to write some test cases using json files for dataframes (whereas production would be parquet). I'm using spark-testing-base framework and I'm running into a snag when asserting data frames equal each other due to schema mismatches where the json schema always has nullable = true.
我希望能够将具有nullable = false的模式应用于读取的json.
I'd like to be able to apply a schema with nullable = false to the json read.
我写了一个小测试用例:
I've written a small test case:
import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
import org.scalatest.FunSuite
class TestJSON extends FunSuite with DataFrameSuiteBase {
val expectedSchema = StructType(
List(StructField("a", IntegerType, nullable = false),
StructField("b", IntegerType, nullable = true))
)
test("testJSON") {
val readJson =
spark.read.schema(expectedSchema).json("src/test/resources/test.json")
assert(readJson.schema == expectedSchema)
}
}
,并具有一个小的test.json文件:
{"a": 1, "b": 2}
{"a": 1}
And have a small test.json file of:
{"a": 1, "b": 2}
{"a": 1}
这将返回
StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true))不相等 StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true))ScalaTestFailureLocation: 预期的TestJSON $$ anonfun $ 1(TestJSON.scala:15) :StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true))实际
:StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true))
StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true)) did not equal StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true)) ScalaTestFailureLocation: TestJSON$$anonfun$1 at (TestJSON.scala:15) Expected :StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,true)) Actual
:StructType(StructField(a,IntegerType,true), StructField(b,IntegerType,true))
我是否以正确的方式应用架构? 我正在使用Spark 2.2,Scala 2.11.8
Am I applying the schema the correct way? I'm using spark 2.2, scala 2.11.8
推荐答案
有一种解决方法,该方法不是直接从文件中读取json,而是使用RDD读取它,然后应用架构.下面是代码:
There is a workaround, where rather than reading the json directly from the file, read it using RDD then it applies the schema. Below is code:
val expectedSchema = StructType(
List(StructField("a", IntegerType, nullable = false),
StructField("b", IntegerType, nullable = true))
)
test("testJSON") {
val jsonRdd =spark.sparkContext.textFile("src/test/resources/test.json")
//val readJson =sparksession.read.schema(expectedSchema).json("src/test/resources/test.json")
val readJson = spark.read.schema(expectedSchema).json(jsonRdd)
readJson.printSchema()
assert(readJson.schema == expectedSchema)
}
测试用例通过,打印模式结果为:
The test case passes and the print schema result is :
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = true)
有JIRA https://issues.apache.org/jira/browse/SPARK -10848 和Apache Spark一起解决了这个问题,他们说这不是问题,并说:
There is JIRA https://issues.apache.org/jira/browse/SPARK-10848 with apache Spark for this issue, which they say is not a problem and said that:
这应该在Spark 2.0中以最新的文件格式重构来解决.如果您仍然遇到问题,请重新打开它.谢谢!
This should be resolved in the latest file format refactoring in Spark 2.0. Please reopen it if you still hit the problem. Thanks!
如果遇到错误,可以再次打开JIRA. 我在spark 2.1.0中进行了测试,但仍然看到相同的问题
If you are getting the error you can open the JIRA again. I tested in spark 2.1.0, and still see the same issue
这篇关于如何将具有nullable = false的架构应用于json读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!