如何让Spark将JSON转义的String字段解析为JSON对象,以推断DataFrames中的适当结构? [英] How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?

查看:588
本文介绍了如何让Spark将JSON转义的String字段解析为JSON对象,以推断DataFrames中的适当结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将每行一组格式化为单个JSON对象的文件作为输入.但是,问题在于这些JSON对象上的一个字段是JSON转义的String.例子

I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example

{
  "id":1,
  "name":"some name",
  "problem_field": "{\"height\":180,\"weight\":80,}",
}

预期地,当使用sqlContext.read.json时,它将创建一个具有3列ID,名称和问题字段的数据帧,其中问题字段是一个字符串.

Expectedly, when using sqlContext.read.json it will create a DataFrame with with the 3 columns id, name and problem_field where problem_field is a String.

我无法控制输入文件,所以我希望能够在Spark中解决此问题,因此,有什么办法可以让Spark将String字段读取为JSON并正确推断其模式?

I have no control over the input files and I'd prefer to be able to solve this problem within Spark so, Is there any way where I can get Spark to read that String field as JSON and to infer its schema properly?

注意:上面的json只是一个玩具示例,在我的案例中,problem_field可以具有可变的不同字段,这对于Spark推断这些字段非常有用,而我不必对存在哪些字段进行任何假设.

Note: the json above is just a toy example, the problem_field in my case would have variable different fields and it would be great for Spark to infer these fields and me not having to make any assumptions about what fields exist.

推荐答案

那是可以接受的解决方案吗?

Would that be acceptable solution?

val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)

val escapedJsons: RDD[String] = sc.parallelize(Seq("""{"id":1,"name":"some name","problem_field":"{\"height\":180,\"weight\":80}"}"""))
val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))
val dfJsons: DataFrame = sqlContext.read.json(unescapedJsons)

dfJsons.printSchema()

// Output
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- problem_field: struct (nullable = true)
|    |-- height: long (nullable = true)
|    |-- weight: long (nullable = true)

这篇关于如何让Spark将JSON转义的String字段解析为JSON对象,以推断DataFrames中的适当结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆