如何让 Spark 将 JSON 转义的 String 字段解析为 JSON 对象以推断 DataFrame 中的正确结构? [英] How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?

查看:43
本文介绍了如何让 Spark 将 JSON 转义的 String 字段解析为 JSON 对象以推断 DataFrame 中的正确结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组文件作为输入,每行格式化为一个 JSON 对象.然而,问题在于这些 JSON 对象上的一个字段是 JSON 转义字符串.示例

I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example

{
  "id":1,
  "name":"some name",
  "problem_field": "{\"height\":180,\"weight\":80,}",
}

预计,当使用 sqlContext.read.json 时,它将创建一个包含 3 列 id、name 和 question_field 的 DataFrame,其中 problem_field 是一个字符串.

Expectedly, when using sqlContext.read.json it will create a DataFrame with with the 3 columns id, name and problem_field where problem_field is a String.

我无法控制输入文件,我更希望能够在 Spark 中解决这个问题,所以,有什么方法可以让 Spark 将该字符串字段读取为 JSON 并正确推断其架构?

I have no control over the input files and I'd prefer to be able to solve this problem within Spark so, Is there any way where I can get Spark to read that String field as JSON and to infer its schema properly?

注意:上面的 json 只是一个玩具示例,在我的例子中,problem_field 将具有可变的不同字段,Spark 推断这些字段会很棒,而且我不必对存在哪些字段做出任何假设.

Note: the json above is just a toy example, the problem_field in my case would have variable different fields and it would be great for Spark to infer these fields and me not having to make any assumptions about what fields exist.

推荐答案

这是可接受的解决方案吗?

Would that be acceptable solution?

val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)

val escapedJsons: RDD[String] = sc.parallelize(Seq("""{"id":1,"name":"some name","problem_field":"{\"height\":180,\"weight\":80}"}"""))
val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))
val dfJsons: DataFrame = sqlContext.read.json(unescapedJsons)

dfJsons.printSchema()

// Output
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- problem_field: struct (nullable = true)
|    |-- height: long (nullable = true)
|    |-- weight: long (nullable = true)

这篇关于如何让 Spark 将 JSON 转义的 String 字段解析为 JSON 对象以推断 DataFrame 中的正确结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆