使用Spark-Scala在列中解析JSON根 [英] Parse JSON root in a column using Spark-Scala

查看:274
本文介绍了使用Spark-Scala在列中解析JSON根的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了将未确定数量的记录的数据帧中的记录转换为JSOM的根,我遇到了问题.

I've problems in order to transform the root of a JSOM a record in a data frame for an undetermined number of records.

我有一个用JSON生成的数据框,类似于以下内容:

I've a data frame generated with a JSON similar the following:

val exampleJson = spark.createDataset(
  """
  {"ITEM1512":
        {"name":"Yin",
         "address":{"city":"Columbus",
                    "state":"Ohio"}
                    }, 
    "ITEM1518":
        {"name":"Yang",
         "address":{"city":"Working",
                    "state":"Marc"}
                    }
  }""" :: Nil)

当我阅读它时,遵循以下指示

When I read it whit the following instruction

val itemsExample = spark.read.json(exampleJson)

生成的架构和数据框架如下:

The Schema and Data Frame generated is the following:

+-----------------------+-----------------------+
|ITEM1512               |ITEM1518               |
+-----------------------+-----------------------+
|[[Columbus, Ohio], Yin]|[[Working, Marc], Yang]|
+-----------------------+-----------------------+

root
 |-- ITEM1512: struct (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- ITEM1518: struct (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |-- name: string (nullable = true)

但是我想生成这样的东西:

But i want to generate something like this:

+-----------------------+-----------------------+
|Item                   |Values                 |
+-----------------------+-----------------------+
|ITEM1512               |[[Columbus, Ohio], Yin]|
|ITEM1518               |[[Working, Marc], Yang]|
+-----------------------+-----------------------+

因此,为了解析此JSON数据,我需要读取所有列并将其添加到数据帧中的记录中,因为作为示例,我要写的不仅仅是这两项.实际上,我想在数据框中添加数百万个项目.

So, in order to parse this JSON data I need to read all the columns and added it to a record in the Data Frame, because there are more than this two items that i write as example. In fact, there are millions of items that I'd like to add in a Data Frame.

我正在尝试复制以下位置的解决方案:如何使用Spark-Scala解析JSON数据 使用以下代码:

I'm trying to replicate the solution found here in: How to parse the JSON data using Spark-Scala with this code:

val columns:Array[String]       = itemsExample.columns
var arrayOfDFs:Array[DataFrame] = Array() 

for(col_name <- columns){

  val temp = itemsExample.selectExpr("explode("+col_name+") as element")
    .select(
      lit(col_name).as("Item"),
      col("element.E").as("Value"))

  arrayOfDFs = arrayOfDFs :+ temp
}

val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)

但是我遇到了这个问题,而在示例中,在另一个问题中,根目录在我的情况下是数组,根目录是StrucType.因此,将引发下一个异常:

But I face with the problem while in the example reading in the other question the root is in array in my case the root is an StrucType. Therefore the next exception is thrown:

org.apache.spark.sql.AnalysisException:无法解决 由于数据类型不匹配而导致'explode(ITEM1512)':输入到函数 爆炸应该是数组或映射类型,而不是 struct,name:string>

org.apache.spark.sql.AnalysisException: cannot resolve 'explode(ITEM1512)' due to data type mismatch: input to function explode should be array or map type, not struct,name:string>

推荐答案

您可以使用stack函数.

Example:

Example:

itemsExample.selectExpr("""stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)""").
show(false)
//+--------+-----------------------+
//|Item    |Values                 |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+


更新:

Dynamic Stack query:

Dynamic Stack query:

val stack=df.columns.map(x => s"'${x}',${x}").mkString(s"stack(${df.columns.size},",",",")as (Item,Values)")
//stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)

itemsExample.selectExpr(stack).show()
//+--------+-----------------------+
//|Item    |Values                 |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+

这篇关于使用Spark-Scala在列中解析JSON根的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆